Skip to main content
This demo contains agentic tasks generated for the WildClawBench framework (OpenClaw). Each task is a self-contained challenge where an agent must solve a problem in a sandboxed workspace using tool calls (shell commands, file I/O, code execution), then produce graded output. Tasks span six categories from PDF parsing to code debugging to safety alignment.

Overview

PropertyValue
Tasks20
Categories6 (Productivity, Code Intelligence, Social Interaction, Search & Retrieval, Creative Synthesis, Safety & Alignment)
DifficultyExtreme
Task formatSingle-agent, tool-use with sandboxed workspace
GradingAutomated via grade() function in each task.md

Environment

Each task provides a sandboxed workspace (exec/) mounted at /tmp_workspace/ in the agent’s container. The agent reads inputs, executes commands, and writes outputs — the final workspace state is graded against ground truth (gt/).
openclaw/
├── 01_Productivity_Flow/          # 3 tasks
│   ├── arxiv_daily_digest_95p_extreme_104/
│   │   ├── exec/                  # Runtime workspace (input data files)
│   │   ├── gt/                    # Grading ground truth (expected outputs)
│   │   ├── task.md                # Task spec with prompt, rubric, and grade() function
│   │   └── reference_trajectory.json  # Reference agent trajectory
│   └── ...
├── 02_Code_Intelligence/          # 3 tasks
├── 03_Social_Interaction/         # 4 tasks
├── 04_Search_Retrieval/           # 3 tasks
├── 05_Creative_Synthesis/         # 3 tasks
└── 06_Safety_Alignment/           # 4 tasks
01_Productivity_Flow (3 tasks): ArXiv daily digest generation, BibTeX extraction from messy PDFs, and conference paper search.02_Code_Intelligence (3 tasks): Benchmark harness execution, code debugging, and multi-file refactoring.03_Social_Interaction (4 tasks): Escalation routing, meeting negotiation, multi-step reasoning over Slack threads, and thread consolidation.04_Search_Retrieval (3 tasks): Affiliation lookup, fuzzy search, and image geolocation.05_Creative_Synthesis (3 tasks): Product poster generation, video highlight clip extraction, and structured video notes.06_Safety_Alignment (4 tasks): Authority boundary testing (ATC tower, prison count), malicious skill repo detection, and prompt injection defense.
CategoryTasks
01_Productivity_Flow3
02_Code_Intelligence3
03_Social_Interaction4
04_Search_Retrieval3
05_Creative_Synthesis3
06_Safety_Alignment4
Total20
TaskMessagesTool CallsTimeout
arxiv_daily_digest_95p1191051200s
bibtex_from_messy_pdf2315900s
conference_paper_search_201841281200s
benchmark_harness_run_big_bench_hard127900s
code_debug_histogram_pair137600s
multi_file_refactor_ticket_attr2115600s
escalation_routing5238600s
meeting_negotiation4527600s
multi_step_reasoning3422600s
thread_consolidation1711600s
affiliation_lookup_6p2210900s
fuzzy_flashattention_3105900s
geolocation_brasilia_cathedral73600s
product_poster_stellar_drone2411600s
video_highlight_clips_16c_goal94900s
video_structured_notes_14t94900s
authority_atc_tower116600s
authority_prison_count105600s
malicious_skill_repo_cloud-uploader105600s
prompt_injection_rm_homoglyph_split_3way105600s
Average24.916.6
Total499333

Data

Each task directory contains exactly four entries:
FileDescription
exec/Runtime workspace — mounted at /tmp_workspace/ for the agent (input PDFs, databases, source code, images, etc.)
gt/Grading ground truth — expected outputs, reference hashes, gold solutions (never visible to the agent)
task.mdTask specification with YAML frontmatter, prompt, grading criteria, and grade() function
reference_trajectory.jsonReference agent trajectory with tool calls and reasoning
Each task.md has YAML frontmatter and standardized sections:Frontmatter:
---
id: train_02_Code_Intelligence_task_multi_file_refactor_ticket_attr_extreme_104
name: "Multi-File Refactor (ticket_attr, extreme)"
category: 02_Code_Intelligence
timeout_seconds: 600
---
SectionPurpose
## PromptAgent instructions — objective, input schema, output schema, constraints
## Expected BehaviorDescription of a correct solution
## Grading CriteriaHuman-readable rubric
## Automated ChecksPython grade() function that scores the final workspace state
## SkillsSkill tags
## EnvEnvironment requirements

Sample

Prompt (from task.md):
hey — can you rename assigned_toassigned_tonew_ across the repo at /tmp_workspace/repo/? needs to hit all call sites (attr access, kwargs, strings in queries/fixtures, the lot) and tests must pass under pytest after. also — make it idempotent, i.e. running it a second time on the already-renamed tree must be a no-op (don’t let it turn into assigned_tonewnew_ or double-rename anything). files in scope look like:
old_name: assigned_to
new_name: assigned_tonew_
touched : ['model.py', 'ops.py', 'loader.py', 'test_ticket_attr.py']
Grading criteria (weighted):
  • pytest returns exit code 0 (0.40)
  • new name appears in ≥ 13 call sites (0.25)
  • old name is fully removed (0.20)
  • every .py file is free of old name (0.15)
Plus a silent-agent guard: pytest_pass is only credited if the agent introduced the new name at least once — the planted tests reference the old name and would trivially pass on an unmodified repo.Ground truth (gt/rename_manifest.json):
{
  "old_name": "assigned_to",
  "new_name": "assigned_tonew_",
  "n_call_sites": 13
}
User:
hey — can you rename assigned_toassigned_tonew_ across the repo at /tmp_workspace/repo/? needs to hit all call sites (attr access, kwargs, strings in queries/fixtures, the lot) and tests must pass under pytest after. also — make it idempotent, i.e. running it a second time on the already-renamed tree must be a no-op (don’t let it turn into assigned_tonewnew_ or double-rename anything). files in scope look like:
old_name: assigned_to
new_name: assigned_tonew_
touched : ['model.py', 'ops.py', 'loader.py', 'test_ticket_attr.py']

Step 1 — Survey + read all four files (5 parallel tool calls)The user wants a repository-wide rename of assigned_toassigned_tonew_, covering attribute access, kwargs, string literals — everything. Idempotency is required so re-running won’t produce assigned_tonewnew_. I’ll start by surveying the repo layout and reading all four files in parallel.
exec({"command": "find /tmp_workspace/repo -type f | head -200 && echo '---' && ls -la /tmp_workspace/repo"})
read({"path": "/tmp_workspace/repo/model.py"})
read({"path": "/tmp_workspace/repo/ops.py"})
read({"path": "/tmp_workspace/repo/loader.py"})
read({"path": "/tmp_workspace/repo/test_ticket_attr.py"})
Result 1/5 — find + ls:
/tmp_workspace/repo/test_ticket_attr.py
/tmp_workspace/repo/model.py
/tmp_workspace/repo/ops.py
/tmp_workspace/repo/loader.py
---
total 28
-rw-r--r-- 1 root root  201 Apr 24 02:15 loader.py
-rw-r--r-- 1 root root  162 Apr 24 02:15 model.py
-rw-r--r-- 1 root root  215 Apr 24 02:15 ops.py
-rw-r--r-- 1 root root  707 Apr 24 02:15 test_ticket_attr.py
Result 2/5 — read model.py:
class TicketRecord:
    def __init__(self, assigned_to='unassigned'):
        self.assigned_to = assigned_to

    def show(self):
        return self.assigned_to
Result 3/5 — read ops.py:
from model import TicketRecord

def get_attr(obj):
    return obj.assigned_to

def set_attr(obj, value):
    obj.assigned_to = value
    return obj

def all_attrs(items):
    return [it.assigned_to for it in items]
Result 4/5 — read loader.py:
from model import TicketRecord

def load(values):
    return [TicketRecord(assigned_to=v) for v in values]

def first_attr(items):
    if not items:
        return None
    return items[0].assigned_to
Result 5/5 — read test_ticket_attr.py:
from model import TicketRecord
from ops import get_attr, set_attr, all_attrs
from loader import load, first_attr

def test_default():
    obj = TicketRecord()
    assert obj.assigned_to == 'unassigned'

def test_show():
    obj = TicketRecord(assigned_to='abc')
    assert obj.show() == 'abc'

def test_get_attr():
    assert get_attr(TicketRecord(assigned_to='xyz')) == 'xyz'

def test_set_attr():
    obj = set_attr(TicketRecord(), 'mut')
    assert obj.assigned_to == 'mut'

def test_all_attrs():
    items = load(['a', 'b', 'c'])
    assert all_attrs(items) == ['a', 'b', 'c']

def test_first_attr():
    items = load(['first'])
    assert first_attr(items) == 'first'
    assert first_attr([]) is None

Step 2 — Snapshot, inventory, environment check (4 parallel tool calls)Four tiny files, clear scope, no fixtures/goldens. The key risk is the shared prefix: assigned_to is a prefix of assigned_tonew_, so a naive substring replace applied twice would cascade into assigned_tonewnew_. I’ll use a word-boundary regex (\bassigned_to\b) which cannot match the assigned_to inside assigned_tonew_ (because the trailing \b fails between o and n — both \w chars), giving idempotency for free. First snapshot, catalogue every occurrence, and confirm no pre-existing assigned_tonew_ strings.
exec({"command": "cp -R /tmp_workspace/repo /tmp_workspace/snapshot_before && ls /tmp_workspace/snapshot_before && echo '--- snapshot created ---'"})
exec({"command": "grep -rn --include='*.py' 'assigned_to' /tmp_workspace/repo || echo 'no matches'"})
exec({"command": "grep -rn --include='*.py' 'assigned_tonew_' /tmp_workspace/repo || echo 'no pre-existing new-name occurrences'"})
exec({"command": "python3 --version && which pytest && pytest --version"})
Result 1/4 — snapshot:
loader.py
model.py
ops.py
test_ticket_attr.py
--- snapshot created ---
Result 2/4 — grep old name (13 hits across 4 files):
/tmp_workspace/repo/test_ticket_attr.py:7:    assert obj.assigned_to == 'unassigned'
/tmp_workspace/repo/test_ticket_attr.py:10:    obj = TicketRecord(assigned_to='abc')
/tmp_workspace/repo/test_ticket_attr.py:14:    assert get_attr(TicketRecord(assigned_to='xyz')) == 'xyz'
/tmp_workspace/repo/test_ticket_attr.py:18:    assert obj.assigned_to == 'mut'
/tmp_workspace/repo/model.py:2:    def __init__(self, assigned_to='unassigned'):
/tmp_workspace/repo/model.py:3:        self.assigned_to = assigned_to
/tmp_workspace/repo/model.py:6:        return self.assigned_to
/tmp_workspace/repo/ops.py:4:    return obj.assigned_to
/tmp_workspace/repo/ops.py:7:    obj.assigned_to = value
/tmp_workspace/repo/ops.py:11:    return [it.assigned_to for it in items]
/tmp_workspace/repo/loader.py:4:    return [TicketRecord(assigned_to=v) for v in values]
/tmp_workspace/repo/loader.py:9:    return items[0].assigned_to
Result 3/4 — grep new name:
no pre-existing new-name occurrences
Result 4/4 — environment:
Python 3.10.12
/usr/local/bin/pytest
pytest 9.0.3

Step 3 — Write rename script and run it (2 parallel tool calls)Inventory locked: 13 hits across 4 files; no pre-existing assigned_tonew_ tokens, so the rename space is clean. Author a small Python script that walks .py files, applies re.sub(r'\bassigned_to\b', 'assigned_tonew_', text), and reports per-file replacement counts. Word-boundary regex is the whole idempotency proof — it also leaves unrelated identifiers like assigned_to_foo alone.
write({"path": "/tmp_workspace/rename.py", "content": "#!/usr/bin/env python3\n\"\"\"Idempotent rename: assigned_to -> assigned_tonew_.\"\"\"\nimport re, sys\nfrom pathlib import Path\nPATTERN = re.compile(r'\\b' + re.escape('assigned_to') + r'\\b')\n... (1,634 bytes total) ..."})
exec({"command": "python3 /tmp_workspace/rename.py /tmp_workspace/repo"})
Result 1/2 — write:
Successfully wrote 1634 bytes to /tmp_workspace/rename.py
Result 2/2 — run rename:
/tmp_workspace/repo/loader.py: 2 replacement(s)
/tmp_workspace/repo/model.py: 4 replacement(s)
/tmp_workspace/repo/ops.py: 3 replacement(s)
/tmp_workspace/repo/test_ticket_attr.py: 4 replacement(s)
---
TOTAL: 13 replacement(s) across 4 file(s)

Step 4 — Verify: diff, pytest, idempotency, audits (4 parallel tool calls)13 replacements across 4 files. Reconciling: model.py line 3 (self.assigned_to = assigned_to) carries 2 occurrences on one line, so 4+3+2+4 = 13 ✓. Now four independent verifications, all safe to run in parallel.
exec({"command": "rm -rf /tmp_workspace/snapshot_after && cp -R /tmp_workspace/repo /tmp_workspace/snapshot_after && echo '=== DIFF (before -> after) ===' && diff -ru /tmp_workspace/snapshot_before /tmp_workspace/snapshot_after"})
exec({"command": "cd /tmp_workspace/repo && pytest -q 2>&1"})
exec({"command": "echo '=== SECOND PASS (idempotency check) ===' && python3 /tmp_workspace/rename.py /tmp_workspace/repo && echo '--- tree diff after second pass ---' && diff -ru /tmp_workspace/snapshot_after /tmp_workspace/repo && echo '(empty diff = second run was a no-op)'"})
exec({"command": "echo '=== grep for double-rename artifact ===' && grep -rn 'assigned_tonewnew' /tmp_workspace/repo || echo 'OK: no assigned_tonewnew anywhere'; echo '=== grep for bare old token left behind ===' && grep -rn --include='*.py' -E '\\bassigned_to\\b' /tmp_workspace/repo || echo 'OK: no bare old token left'; echo '=== final occurrence list (new token) ===' && grep -rn 'assigned_tonew_' /tmp_workspace/repo"})
Result 1/4 — diff (before → after, all 4 files; truncated):
=== DIFF (before -> after) ===
diff -ru /tmp_workspace/snapshot_before/loader.py /tmp_workspace/snapshot_after/loader.py
@@ -1,9 +1,9 @@
 from model import TicketRecord

 def load(values):
-    return [TicketRecord(assigned_to=v) for v in values]
+    return [TicketRecord(assigned_tonew_=v) for v in values]

 def first_attr(items):
     if not items:
         return None
-    return items[0].assigned_to
+    return items[0].assigned_tonew_
diff -ru /tmp_workspace/snapshot_before/model.py /tmp_workspace/snapshot_after/model.py
@@ -1,6 +1,6 @@
 class TicketRecord:
-    def __init__(self, assigned_to='unassigned'):
-        self.assigned_to = assigned_to
+    def __init__(self, assigned_tonew_='unassigned'):
+        self.assigned_tonew_ = assigned_tonew_

     def show(self):
-        return self.assigned_to
+        return self.assigned_tonew_
diff -ru /tmp_workspace/snapshot_before/ops.py /tmp_workspace/snapshot_after/ops.py
@@ -1,11 +1,11 @@
 from model import TicketRecord

 def get_attr(obj):
-    return obj.assigned_to
+    return obj.assigned_tonew_

 def set_attr(obj, value):
-    obj.assigned_to = value
+    obj.assigned_tonew_ = value
     return obj

 def all_attrs(items):
-    return [it.assigned_to for it in items]
+    return [it.assigned_tonew_ for it in items]
diff -ru /tmp_workspace/snapshot_before/test_ticket_attr.py /tmp_workspace/snapshot_after/test_ticket_attr.py
@@ -4,18 +4,18 @@
 def test_default():
     obj = TicketRecord()
-    assert obj.assigned_to == 'unassigned'
+    assert obj.assigned_tonew_ == 'unassigned'

 def test_show():
-    obj = TicketRecord(assigned_to='abc')
+    obj = TicketRecord(assigned_tonew_='abc')
     assert obj.show() == 'abc'
... (13 line changes total, exactly the intended ones)
Result 2/4 — pytest:
......                                                                   [100%]
6 passed in 0.01s
Result 3/4 — second pass (idempotency):
=== SECOND PASS (idempotency check) ===
/tmp_workspace/repo/loader.py: 0 replacement(s)
/tmp_workspace/repo/model.py: 0 replacement(s)
/tmp_workspace/repo/ops.py: 0 replacement(s)
/tmp_workspace/repo/test_ticket_attr.py: 0 replacement(s)
---
TOTAL: 0 replacement(s) across 0 file(s)
--- tree diff after second pass ---
(empty diff = second run was a no-op)
Result 4/4 — grep audits + final occurrence list:
=== grep for double-rename artifact ===
OK: no assigned_tonewnew anywhere
=== grep for bare old token left behind ===
OK: no bare old token left
=== final occurrence list (new token) ===
/tmp_workspace/repo/test_ticket_attr.py:7:    assert obj.assigned_tonew_ == 'unassigned'
/tmp_workspace/repo/test_ticket_attr.py:10:    obj = TicketRecord(assigned_tonew_='abc')
/tmp_workspace/repo/test_ticket_attr.py:14:    assert get_attr(TicketRecord(assigned_tonew_='xyz')) == 'xyz'
/tmp_workspace/repo/test_ticket_attr.py:18:    assert obj.assigned_tonew_ == 'mut'
/tmp_workspace/repo/model.py:2:    def __init__(self, assigned_tonew_='unassigned'):
/tmp_workspace/repo/model.py:3:        self.assigned_tonew_ = assigned_tonew_
/tmp_workspace/repo/model.py:6:        return self.assigned_tonew_
/tmp_workspace/repo/ops.py:4:    return obj.assigned_tonew_
/tmp_workspace/repo/ops.py:7:    obj.assigned_tonew_ = value
/tmp_workspace/repo/ops.py:11:    return [it.assigned_tonew_ for it in items]
/tmp_workspace/repo/loader.py:4:    return [TicketRecord(assigned_tonew_=v) for v in values]
/tmp_workspace/repo/loader.py:9:    return items[0].assigned_tonew_

Step 5 — Final answer
Rename doneassigned_toassigned_tonew_ across /tmp_workspace/repo/:
  • model.py: 4 hits (ctor kwarg, attr assign w/ same-name RHS, show return)
  • ops.py: 3 hits (read, write, comprehension)
  • loader.py: 2 hits (ctor kwarg, attr read)
  • test_ticket_attr.py: 4 hits (attr checks + kwargs)
Verification:
  • diff -ru snapshot_before snapshot_after shows exactly the 13 intended line changes — no untouched-content drift.
  • pytest -q6 passed in 0.01s.
  • Second pass: 0 replacements, empty diff — proven no-op.
  • No assigned_tonewnew anywhere; no bare \bassigned_to\b left behind; all 13 expected sites carry the new token.
Why idempotency holds — the script uses re.compile(r'\bassigned_to\b') rather than substring replace. Inside the renamed token assigned_tonew_, the boundary after ...to sits between o and n (both \w), so \b cannot match; the regex produces zero hits on a second run. It also leaves unrelated identifiers like assigned_to_foo alone for the same reason.

Download

# Download all OpenClaw data
hf download jindidi/eigendata-demo-data --repo-type dataset --include "openclaw/*"

# Download a specific category
hf download jindidi/eigendata-demo-data --repo-type dataset --include "openclaw/02_Code_Intelligence/*"

Browse on Hugging Face

View OpenClaw files