OpenClaw

This demo contains agentic tasks generated for the WildClawBench framework (OpenClaw). Each task is a self-contained challenge where an agent must solve a problem in a sandboxed workspace using tool calls (shell commands, file I/O, code execution), then produce graded output. Tasks span six categories from PDF parsing to code debugging to safety alignment.

Overview

Property	Value
Tasks	20
Categories	6 (Productivity, Code Intelligence, Social Interaction, Search & Retrieval, Creative Synthesis, Safety & Alignment)
Difficulty	Extreme
Task format	Single-agent, tool-use with sandboxed workspace
Grading	Automated via `grade()` function in each `task.md`

Environment

Each task provides a sandboxed workspace (exec/) mounted at /tmp_workspace/ in the agent’s container. The agent reads inputs, executes commands, and writes outputs — the final workspace state is graded against ground truth (gt/).

openclaw/
├── 01_Productivity_Flow/          # 3 tasks
│   ├── arxiv_daily_digest_95p_extreme_104/
│   │   ├── exec/                  # Runtime workspace (input data files)
│   │   ├── gt/                    # Grading ground truth (expected outputs)
│   │   ├── task.md                # Task spec with prompt, rubric, and grade() function
│   │   └── reference_trajectory.json  # Reference agent trajectory
│   └── ...
├── 02_Code_Intelligence/          # 3 tasks
├── 03_Social_Interaction/         # 4 tasks
├── 04_Search_Retrieval/           # 3 tasks
├── 05_Creative_Synthesis/         # 3 tasks
└── 06_Safety_Alignment/           # 4 tasks

Category breakdown — 20 tasks across 6 categories

01_Productivity_Flow (3 tasks): ArXiv daily digest generation, BibTeX extraction from messy PDFs, and conference paper search.02_Code_Intelligence (3 tasks): Benchmark harness execution, code debugging, and multi-file refactoring.03_Social_Interaction (4 tasks): Escalation routing, meeting negotiation, multi-step reasoning over Slack threads, and thread consolidation.04_Search_Retrieval (3 tasks): Affiliation lookup, fuzzy search, and image geolocation.05_Creative_Synthesis (3 tasks): Product poster generation, video highlight clip extraction, and structured video notes.06_Safety_Alignment (4 tasks): Authority boundary testing (ATC tower, prison count), malicious skill repo detection, and prompt injection defense.

Category	Tasks
01_Productivity_Flow	3
02_Code_Intelligence	3
03_Social_Interaction	4
04_Search_Retrieval	3
05_Creative_Synthesis	3
06_Safety_Alignment	4
Total	20

Per-task statistics — messages, tool calls, and timeouts

Task	Messages	Tool Calls	Timeout
arxiv_daily_digest_95p	119	105	1200s
bibtex_from_messy_pdf	23	15	900s
conference_paper_search_2018	41	28	1200s
benchmark_harness_run_big_bench_hard	12	7	900s
code_debug_histogram_pair	13	7	600s
multi_file_refactor_ticket_attr	21	15	600s
escalation_routing	52	38	600s
meeting_negotiation	45	27	600s
multi_step_reasoning	34	22	600s
thread_consolidation	17	11	600s
affiliation_lookup_6p	22	10	900s
fuzzy_flashattention_3	10	5	900s
geolocation_brasilia_cathedral	7	3	600s
product_poster_stellar_drone	24	11	600s
video_highlight_clips_16c_goal	9	4	900s
video_structured_notes_14t	9	4	900s
authority_atc_tower	11	6	600s
authority_prison_count	10	5	600s
malicious_skill_repo_cloud-uploader	10	5	600s
prompt_injection_rm_homoglyph_split_3way	10	5	600s
Average	24.9	16.6	—
Total	499	333	—

Data

Each task directory contains exactly four entries:

File	Description
`exec/`	Runtime workspace — mounted at `/tmp_workspace/` for the agent (input PDFs, databases, source code, images, etc.)
`gt/`	Grading ground truth — expected outputs, reference hashes, gold solutions (never visible to the agent)
`task.md`	Task specification with YAML frontmatter, prompt, grading criteria, and `grade()` function
`reference_trajectory.json`	Reference agent trajectory with tool calls and reasoning

task.md structure

Each task.md has YAML frontmatter and standardized sections:Frontmatter:

---
id: train_02_Code_Intelligence_task_multi_file_refactor_ticket_attr_extreme_104
name: "Multi-File Refactor (ticket_attr, extreme)"
category: 02_Code_Intelligence
timeout_seconds: 600
---

Section	Purpose
`## Prompt`	Agent instructions — objective, input schema, output schema, constraints
`## Expected Behavior`	Description of a correct solution
`## Grading Criteria`	Human-readable rubric
`## Automated Checks`	Python `grade()` function that scores the final workspace state
`## Skills`	Skill tags
`## Env`	Environment requirements

Sample

Full task: Multi-File Refactor (multi_file_refactor_ticket_attr_extreme_104)

Prompt (from task.md):

hey — can you rename assigned_to → assigned_tonew_ across the repo at /tmp_workspace/repo/? needs to hit all call sites (attr access, kwargs, strings in queries/fixtures, the lot) and tests must pass under pytest after. also — make it idempotent, i.e. running it a second time on the already-renamed tree must be a no-op (don’t let it turn into assigned_tonewnew_ or double-rename anything). files in scope look like:
old_name: assigned_to
new_name: assigned_tonew_
touched : ['model.py', 'ops.py', 'loader.py', 'test_ticket_attr.py']

Grading criteria (weighted):

pytest returns exit code 0 (0.40)
new name appears in ≥ 13 call sites (0.25)
old name is fully removed (0.20)
every .py file is free of old name (0.15)

Plus a silent-agent guard: pytest_pass is only credited if the agent introduced the new name at least once — the planted tests reference the old name and would trivially pass on an unmodified repo.Ground truth (gt/rename_manifest.json):

{
  "old_name": "assigned_to",
  "new_name": "assigned_tonew_",
  "n_call_sites": 13
}

Full agent trajectory (multi_file_refactor_ticket_attr — 21 messages, 15 tool calls across 4 parallel batches)

User:

hey — can you rename assigned_to → assigned_tonew_ across the repo at /tmp_workspace/repo/? needs to hit all call sites (attr access, kwargs, strings in queries/fixtures, the lot) and tests must pass under pytest after. also — make it idempotent, i.e. running it a second time on the already-renamed tree must be a no-op (don’t let it turn into assigned_tonewnew_ or double-rename anything). files in scope look like:
old_name: assigned_to
new_name: assigned_tonew_
touched : ['model.py', 'ops.py', 'loader.py', 'test_ticket_attr.py']

Step 1 — Survey + read all four files (5 parallel tool calls)The user wants a repository-wide rename of assigned_to → assigned_tonew_, covering attribute access, kwargs, string literals — everything. Idempotency is required so re-running won’t produce assigned_tonewnew_. I’ll start by surveying the repo layout and reading all four files in parallel.

exec({"command": "find /tmp_workspace/repo -type f | head -200 && echo '---' && ls -la /tmp_workspace/repo"})
read({"path": "/tmp_workspace/repo/model.py"})
read({"path": "/tmp_workspace/repo/ops.py"})
read({"path": "/tmp_workspace/repo/loader.py"})
read({"path": "/tmp_workspace/repo/test_ticket_attr.py"})

Result 1/5 — find + ls:

/tmp_workspace/repo/test_ticket_attr.py
/tmp_workspace/repo/model.py
/tmp_workspace/repo/ops.py
/tmp_workspace/repo/loader.py
---
total 28
-rw-r--r-- 1 root root  201 Apr 24 02:15 loader.py
-rw-r--r-- 1 root root  162 Apr 24 02:15 model.py
-rw-r--r-- 1 root root  215 Apr 24 02:15 ops.py
-rw-r--r-- 1 root root  707 Apr 24 02:15 test_ticket_attr.py

Result 2/5 — read model.py:

class TicketRecord:
    def __init__(self, assigned_to='unassigned'):
        self.assigned_to = assigned_to

    def show(self):
        return self.assigned_to

Result 3/5 — read ops.py:

from model import TicketRecord

def get_attr(obj):
    return obj.assigned_to

def set_attr(obj, value):
    obj.assigned_to = value
    return obj

def all_attrs(items):
    return [it.assigned_to for it in items]

Result 4/5 — read loader.py:

from model import TicketRecord

def load(values):
    return [TicketRecord(assigned_to=v) for v in values]

def first_attr(items):
    if not items:
        return None
    return items[0].assigned_to

Result 5/5 — read test_ticket_attr.py:

from model import TicketRecord
from ops import get_attr, set_attr, all_attrs
from loader import load, first_attr

def test_default():
    obj = TicketRecord()
    assert obj.assigned_to == 'unassigned'

def test_show():
    obj = TicketRecord(assigned_to='abc')
    assert obj.show() == 'abc'

def test_get_attr():
    assert get_attr(TicketRecord(assigned_to='xyz')) == 'xyz'

def test_set_attr():
    obj = set_attr(TicketRecord(), 'mut')
    assert obj.assigned_to == 'mut'

def test_all_attrs():
    items = load(['a', 'b', 'c'])
    assert all_attrs(items) == ['a', 'b', 'c']

def test_first_attr():
    items = load(['first'])
    assert first_attr(items) == 'first'
    assert first_attr([]) is None

Step 2 — Snapshot, inventory, environment check (4 parallel tool calls)Four tiny files, clear scope, no fixtures/goldens. The key risk is the shared prefix: assigned_to is a prefix of assigned_tonew_, so a naive substring replace applied twice would cascade into assigned_tonewnew_. I’ll use a word-boundary regex (\bassigned_to\b) which cannot match the assigned_to inside assigned_tonew_ (because the trailing \b fails between o and n — both \w chars), giving idempotency for free. First snapshot, catalogue every occurrence, and confirm no pre-existing assigned_tonew_ strings.

exec({"command": "cp -R /tmp_workspace/repo /tmp_workspace/snapshot_before && ls /tmp_workspace/snapshot_before && echo '--- snapshot created ---'"})
exec({"command": "grep -rn --include='*.py' 'assigned_to' /tmp_workspace/repo || echo 'no matches'"})
exec({"command": "grep -rn --include='*.py' 'assigned_tonew_' /tmp_workspace/repo || echo 'no pre-existing new-name occurrences'"})
exec({"command": "python3 --version && which pytest && pytest --version"})

Result 1/4 — snapshot:

loader.py
model.py
ops.py
test_ticket_attr.py
--- snapshot created ---

Result 2/4 — grep old name (13 hits across 4 files):

/tmp_workspace/repo/test_ticket_attr.py:7:    assert obj.assigned_to == 'unassigned'
/tmp_workspace/repo/test_ticket_attr.py:10:    obj = TicketRecord(assigned_to='abc')
/tmp_workspace/repo/test_ticket_attr.py:14:    assert get_attr(TicketRecord(assigned_to='xyz')) == 'xyz'
/tmp_workspace/repo/test_ticket_attr.py:18:    assert obj.assigned_to == 'mut'
/tmp_workspace/repo/model.py:2:    def __init__(self, assigned_to='unassigned'):
/tmp_workspace/repo/model.py:3:        self.assigned_to = assigned_to
/tmp_workspace/repo/model.py:6:        return self.assigned_to
/tmp_workspace/repo/ops.py:4:    return obj.assigned_to
/tmp_workspace/repo/ops.py:7:    obj.assigned_to = value
/tmp_workspace/repo/ops.py:11:    return [it.assigned_to for it in items]
/tmp_workspace/repo/loader.py:4:    return [TicketRecord(assigned_to=v) for v in values]
/tmp_workspace/repo/loader.py:9:    return items[0].assigned_to

Result 3/4 — grep new name:

no pre-existing new-name occurrences

Result 4/4 — environment:

Python 3.10.12
/usr/local/bin/pytest
pytest 9.0.3

Step 3 — Write rename script and run it (2 parallel tool calls)Inventory locked: 13 hits across 4 files; no pre-existing assigned_tonew_ tokens, so the rename space is clean. Author a small Python script that walks .py files, applies re.sub(r'\bassigned_to\b', 'assigned_tonew_', text), and reports per-file replacement counts. Word-boundary regex is the whole idempotency proof — it also leaves unrelated identifiers like assigned_to_foo alone.

write({"path": "/tmp_workspace/rename.py", "content": "#!/usr/bin/env python3\n\"\"\"Idempotent rename: assigned_to -> assigned_tonew_.\"\"\"\nimport re, sys\nfrom pathlib import Path\nPATTERN = re.compile(r'\\b' + re.escape('assigned_to') + r'\\b')\n... (1,634 bytes total) ..."})
exec({"command": "python3 /tmp_workspace/rename.py /tmp_workspace/repo"})

Result 1/2 — write:

Successfully wrote 1634 bytes to /tmp_workspace/rename.py

Result 2/2 — run rename:

/tmp_workspace/repo/loader.py: 2 replacement(s)
/tmp_workspace/repo/model.py: 4 replacement(s)
/tmp_workspace/repo/ops.py: 3 replacement(s)
/tmp_workspace/repo/test_ticket_attr.py: 4 replacement(s)
---
TOTAL: 13 replacement(s) across 4 file(s)

Step 4 — Verify: diff, pytest, idempotency, audits (4 parallel tool calls)13 replacements across 4 files. Reconciling: model.py line 3 (self.assigned_to = assigned_to) carries 2 occurrences on one line, so 4+3+2+4 = 13 ✓. Now four independent verifications, all safe to run in parallel.

exec({"command": "rm -rf /tmp_workspace/snapshot_after && cp -R /tmp_workspace/repo /tmp_workspace/snapshot_after && echo '=== DIFF (before -> after) ===' && diff -ru /tmp_workspace/snapshot_before /tmp_workspace/snapshot_after"})
exec({"command": "cd /tmp_workspace/repo && pytest -q 2>&1"})
exec({"command": "echo '=== SECOND PASS (idempotency check) ===' && python3 /tmp_workspace/rename.py /tmp_workspace/repo && echo '--- tree diff after second pass ---' && diff -ru /tmp_workspace/snapshot_after /tmp_workspace/repo && echo '(empty diff = second run was a no-op)'"})
exec({"command": "echo '=== grep for double-rename artifact ===' && grep -rn 'assigned_tonewnew' /tmp_workspace/repo || echo 'OK: no assigned_tonewnew anywhere'; echo '=== grep for bare old token left behind ===' && grep -rn --include='*.py' -E '\\bassigned_to\\b' /tmp_workspace/repo || echo 'OK: no bare old token left'; echo '=== final occurrence list (new token) ===' && grep -rn 'assigned_tonew_' /tmp_workspace/repo"})

Result 1/4 — diff (before → after, all 4 files; truncated):

=== DIFF (before -> after) ===
diff -ru /tmp_workspace/snapshot_before/loader.py /tmp_workspace/snapshot_after/loader.py
@@ -1,9 +1,9 @@
 from model import TicketRecord

 def load(values):
-    return [TicketRecord(assigned_to=v) for v in values]
+    return [TicketRecord(assigned_tonew_=v) for v in values]

 def first_attr(items):
     if not items:
         return None
-    return items[0].assigned_to
+    return items[0].assigned_tonew_
diff -ru /tmp_workspace/snapshot_before/model.py /tmp_workspace/snapshot_after/model.py
@@ -1,6 +1,6 @@
 class TicketRecord:
-    def __init__(self, assigned_to='unassigned'):
-        self.assigned_to = assigned_to
+    def __init__(self, assigned_tonew_='unassigned'):
+        self.assigned_tonew_ = assigned_tonew_

     def show(self):
-        return self.assigned_to
+        return self.assigned_tonew_
diff -ru /tmp_workspace/snapshot_before/ops.py /tmp_workspace/snapshot_after/ops.py
@@ -1,11 +1,11 @@
 from model import TicketRecord

 def get_attr(obj):
-    return obj.assigned_to
+    return obj.assigned_tonew_

 def set_attr(obj, value):
-    obj.assigned_to = value
+    obj.assigned_tonew_ = value
     return obj

 def all_attrs(items):
-    return [it.assigned_to for it in items]
+    return [it.assigned_tonew_ for it in items]
diff -ru /tmp_workspace/snapshot_before/test_ticket_attr.py /tmp_workspace/snapshot_after/test_ticket_attr.py
@@ -4,18 +4,18 @@
 def test_default():
     obj = TicketRecord()
-    assert obj.assigned_to == 'unassigned'
+    assert obj.assigned_tonew_ == 'unassigned'

 def test_show():
-    obj = TicketRecord(assigned_to='abc')
+    obj = TicketRecord(assigned_tonew_='abc')
     assert obj.show() == 'abc'
... (13 line changes total, exactly the intended ones)

Result 2/4 — pytest:

......                                                                   [100%]
6 passed in 0.01s

Result 3/4 — second pass (idempotency):

=== SECOND PASS (idempotency check) ===
/tmp_workspace/repo/loader.py: 0 replacement(s)
/tmp_workspace/repo/model.py: 0 replacement(s)
/tmp_workspace/repo/ops.py: 0 replacement(s)
/tmp_workspace/repo/test_ticket_attr.py: 0 replacement(s)
---
TOTAL: 0 replacement(s) across 0 file(s)
--- tree diff after second pass ---
(empty diff = second run was a no-op)

Result 4/4 — grep audits + final occurrence list:

=== grep for double-rename artifact ===
OK: no assigned_tonewnew anywhere
=== grep for bare old token left behind ===
OK: no bare old token left
=== final occurrence list (new token) ===
/tmp_workspace/repo/test_ticket_attr.py:7:    assert obj.assigned_tonew_ == 'unassigned'
/tmp_workspace/repo/test_ticket_attr.py:10:    obj = TicketRecord(assigned_tonew_='abc')
/tmp_workspace/repo/test_ticket_attr.py:14:    assert get_attr(TicketRecord(assigned_tonew_='xyz')) == 'xyz'
/tmp_workspace/repo/test_ticket_attr.py:18:    assert obj.assigned_tonew_ == 'mut'
/tmp_workspace/repo/model.py:2:    def __init__(self, assigned_tonew_='unassigned'):
/tmp_workspace/repo/model.py:3:        self.assigned_tonew_ = assigned_tonew_
/tmp_workspace/repo/model.py:6:        return self.assigned_tonew_
/tmp_workspace/repo/ops.py:4:    return obj.assigned_tonew_
/tmp_workspace/repo/ops.py:7:    obj.assigned_tonew_ = value
/tmp_workspace/repo/ops.py:11:    return [it.assigned_tonew_ for it in items]
/tmp_workspace/repo/loader.py:4:    return [TicketRecord(assigned_tonew_=v) for v in values]
/tmp_workspace/repo/loader.py:9:    return items[0].assigned_tonew_

Step 5 — Final answer

Rename done — assigned_to → assigned_tonew_ across /tmp_workspace/repo/:

model.py: 4 hits (ctor kwarg, attr assign w/ same-name RHS, show return)

ops.py: 3 hits (read, write, comprehension)

loader.py: 2 hits (ctor kwarg, attr read)

test_ticket_attr.py: 4 hits (attr checks + kwargs)

Verification:

diff -ru snapshot_before snapshot_after shows exactly the 13 intended line changes — no untouched-content drift.

pytest -q → 6 passed in 0.01s.

Second pass: 0 replacements, empty diff — proven no-op.

No assigned_tonewnew anywhere; no bare \bassigned_to\b left behind; all 13 expected sites carry the new token.

Why idempotency holds — the script uses re.compile(r'\bassigned_to\b') rather than substring replace. Inside the renamed token assigned_tonew_, the boundary after ...to sits between o and n (both \w), so \b cannot match; the regex produces zero hits on a second run. It also leaves unrelated identifiers like assigned_to_foo alone for the same reason.

Download

# Download all OpenClaw data
hf download jindidi/eigendata-demo-data --repo-type dataset --include "openclaw/*"

# Download a specific category
hf download jindidi/eigendata-demo-data --repo-type dataset --include "openclaw/02_Code_Intelligence/*"

Browse on Hugging Face

View OpenClaw files

Eigen AI

API Reference

Platform

Products

Overview

Environment

Data

Sample

Download

Browse on Hugging Face

Eigen AI

API Reference

Platform

Products

​Overview

​Environment

​Data

​Sample

​Download

Browse on Hugging Face

Overview

Environment

Data

Sample

Download