Want to try it first? A free, fully-verified 30-task sample is available on the Demo Samples page.
What WildClawBench is
A WildClawBench task is a single user prompt describing a deliverable — a renamed repo that still passespytest, a filled-in JSON results file, a personal homepage, a shortest coauthorship chain, a refusal-with-reasoning — paired with a sandboxed workspace mounted at /tmp_workspace/. The agent works in a non-interactive action loop: it deliberates, calls a tool, reads the result on the next turn, and repeats until it emits a final message. There is no human in the loop — every tool call is executed immediately and every message is consumed by the harness verbatim.
When the loop ends, the task’s grade() function inspects the end-state of the workspace — the files the agent wrote, the transcript it produced — and returns a reward in [0, 1]. Graders never take a stated intention as evidence: a claimed refusal is only credited if the filesystem and transcript actually show the safe behavior, and a claimed test pass is only credited if pytest actually returns zero.
At a glance
| Property | Value |
|---|---|
| Categories | 6 — Productivity Flow, Code Intelligence, Social Interaction, Search & Retrieval, Creative Synthesis, Safety & Alignment |
| Samples | 3,450 — task + sandboxed environment + reward verifier, usable for SFT or RL. SFT-ready samples additionally ship a successful agent trajectory; the rest remain RL-trainable via the environment + verifier. |
| Task format | Single prompt + sandboxed /tmp_workspace/ + executable grade() rubric |
| Workspace file types | PDF, XLSX, CSV, DOCX, images, audio/video, JSON/JSONL, SQLite, Markdown, source repos |
| Agent tools | 17 native tools — file I/O, shell/code execution, sub-sessions, web search/fetch, memory, image |
| Grading | Per-task weighted Python grade(); reward ∈ [0, 1], with hard caps and adversarial-output guards |
What’s inside
| Component | Description |
|---|---|
| Samples | The unit of release. Each sample bundles a task, the sandboxed environment, and a reward-verification function. SFT-ready samples additionally include a successful agent trajectory (also RL-trainable); the rest remain RL-trainable via the environment + verifier. |
| Environments | A self-contained sandboxed workspace per task, mounted at /tmp_workspace/ — input PDFs, spreadsheets, source code, images, databases, etc. |
| Tasks & rubrics | Each task ships a prompt, the expected behavior, and a human-readable rubric. Most prompts embed the exact input/output schema the deliverable must conform to. |
| Trajectories | Agent rollouts with chain-of-thought reasoning and native tool calls — the SFT-ready samples ship a successful (passing) rollout as the trajectory. |
| Reward verifiers | An executable grade() function per task that scores the workspace end-state against ground truth — usable directly as an RL reward signal. |
| Tool schemas | The fixed 17-tool native action space, identical on every sample (what the agent could call, not just what a given trajectory invoked). |
The native toolset
WildClawBench agents use a fixed native toolset exposed through the provider’s native tool-calling interface (not MCP servers), identical on every task:read, edit, write, exec, process, cron, sessions_list, sessions_history, sessions_send, sessions_spawn, subagents, session_status, web_search, web_fetch, image, memory_search, memory_get.
This gives the agent a full workstation: a filesystem, a shell and long-running process model, scheduled jobs, spawnable sub-agents, web access, and a memory store.
Task categories
The six categories below are the corpus’s current case studies — each exercises a distinct capability axis, and each is built from a set of generator families (e.g.multi_file_refactor, arxiv_daily_digest, file_overwrite_cascade) instantiated across the four difficulty tiers with many scenario variants. Because every task instance is generated by EigenData-CLI from a generator family plus a grader, the taxonomy is extensible: a new category is a new set of generator families and their reward verifiers (see Extending the taxonomy).
01 — Productivity Flow
Document and knowledge-work automation. The agent parses messy source documents and emits a strictly-schema’d artifact.- Families: ArXiv daily digests, BibTeX extraction from messy PDFs, calendar scheduling, ICS temporal reasoning, email triage, table extraction from PDFs (LaTeX/CSV), spreadsheet audits, image classification.
- What it tests: faithful extraction, schema compliance, and locale/unit/timezone discipline. Output is JSON / CSV / ICS that a grader parses field-by-field.
02 — Code Intelligence
Real software engineering inside a live repo. This is the corpus’s behavioral-execution gold standard.- Families: API-contract implementation, real code debugging, multi-file refactors, stack-trace debugging, benchmark-harness runs, ML-repo inference, plus visual/logic puzzles (link-a-pix, connect-dots, jigsaw).
- What it tests: producing code that runs. Graders execute
pytestor the agent’s own code against a frozen golden harness — no AST-only or regex-only credit — with silent-agent guards so an unmodified repo cannot trivially pass.
03 — Social Interaction
Multi-message, multi-agent communication over a live mock messaging service.- Families: action-item extraction, cross-department updates, multi-step reasoning over threads, thread consolidation, escalation routing, meeting negotiation.
- What it tests: reading a thread, resolving “newest update wins” / “explicit corrections override stale statements” semantics, and routing to the right party. The strongest graders pin content credit to canonical contacts with distractor penalties.
04 — Search & Retrieval
Web and local retrieval with verification.- Families: scholar coauthorship chains, conference/scholar paper search, hybrid (local + web) search with revision verification, fuzzy entity search, affiliation lookup, SQL analytics, wiki crawling, RAG over a corpus.
- What it tests: finding the right facts and committing to an exact, schema-gated answer. Graders strict-match the answer behind a one-key schema gate and check that claims trace to a named source.
05 — Creative Synthesis
Synthesis of multimodal artifacts from rich source material.- Families: video → structured PDF/JSON, paper → poster, repo/paper → slides, multi-image composition, video highlight-clip extraction, video dubbing (EN→ZH).
- What it tests: turning audio/video/papers into faithful, well-formed deliverables. Graders detect adversarial outputs (blank/uniform fills, fenced wrappers, keyword-only stubs, off-screen-hidden HTML) and apply hard caps before the weighted composition.
06 — Safety & Alignment
Refusal and alignment under pressure, graded on actual behavior.- Families: file-overwrite cascades (refuse to destroy downstream-referenced content), authority validation (refuse edits that conflict with a fixed authority), content refusal (misinformation, hate), malicious skill-repo refusal, secret detection, destructive-filesystem refusal.
- What it tests: whether the agent does the safe thing — and articulates why — rather than obeying a harmful instruction. Graders cross-check the real filesystem, transcript, and artifacts; a stated rejection is never accepted as evidence of refusal.
Extending the taxonomy
The six categories are case studies, not a closed set. Each is a bundle of generator families plus reward verifiers, so the corpus extends by adding new families — and natural follow-ups include:- Data & Analytics Ops — BI dashboards, ETL pipelines, multi-table joins and reconciliations.
- DevOps & Infrastructure — CI/CD config, Dockerfiles and IaC, dependency upgrades that must keep a build green.
- Web & Browser Automation — form filling, multi-page navigation, scraping behind state.
- Scientific Reproduction — re-running a paper’s experiment from its repo to a target metric.
- Financial & Spreadsheet Modeling — multi-tab workbook construction and formula-level analysis.
- Long-Horizon Project Execution — multi-stage deliverables that chain several of the above across a single sustained session.
- Deeper Multi-Agent Orchestration — extending Social Interaction into spawned-sub-agent coordination via
sessions_*/subagents.
grade() per task, so additions are immediately RL-trainable and (with a passing rollout) SFT-ready.
Difficulty profile
Every task carries one of four difficulty tiers, and most generator families are instantiated across the full range:| Tier | What scales |
|---|---|
| Easy | Small inputs, direct extraction, lenient formatting. The deliverable is reachable in a handful of tool calls. |
| Medium | Larger inputs, mild ambiguity, stricter schemas. Requires a short plan and verification. |
| Hard | Multi-file / multi-source work, distractors, exact-format output, and partial-credit rubrics that punish near-misses. |
| Extreme | Adversarial inputs, schema traps, cascade dependencies, and safety pressure. Designed to defeat current frontier models. |
How challenging is the data
As a reference point, a frontier-scale open-weight model — Qwen3.5-397B — was evaluated on a diagnostic slice of the corpus: 18 tasks spanning three categories and all four difficulty tiers, scored by each task’s nativegrade(). The slice deliberately probes three different capability axes rather than averaging over easy wins.
Per-category result (mean reward over the slice):
| Category | Family probed | Tasks | Tiers covered | Mean reward |
|---|---|---|---|---|
| 02 Code Intelligence | personal-homepage generation | 6 | easy → extreme | 0.946 |
| 05 Creative Synthesis | video → structured PDF/JSON | 4 | easy → extreme | 0.400 |
| 06 Safety & Alignment | file-overwrite cascade | 8 | extreme | 0.081 |
| All | — | 18 | — | 0.440 |
- Code Intelligence (0.95) — near-ceiling. The model reliably produces valid, accessible, responsive HTML homepages: WCAG checks, JSON-LD, responsive media queries, and content coverage almost all pass. Structured front-end generation is a solved capability for a model this size.
- Creative Synthesis (0.40) — capped, not failed. The model gets the output schema right every time (
schema_keys = 1.0) but completely misses feature extraction (features_f1 = 0.0) and only partially matches fields and summary coverage. It produces a well-formed shell of the right shape with the wrong contents — the grader’s weighting pins this class at ~0.40. - Safety & Alignment (0.08) — near-total failure. On extreme-tier file-overwrite-cascade tasks, the model usually leaves the protected file untouched (
file_unchanged = 1.0) but never articulates a refusal (refusal_articulated = 0.0across all 8) and never produces the coherent two-file safe plan the task demands (coherent_two_file_plan = 0.0across all 8), often hallucinating downstream files. Seven of eight score0.0overall.
Training utility
Supervised fine-tuning (SFT) a smaller open-weight model on successful WildClawBench trajectories yields a measurable lift on the held-out InternLM WildClawBench benchmark — 60 tasks across the six categories, run end-to-end in a live OpenClaw environment (a real bash shell, filesystem, browser, and email/calendar services). Because the SFT data is text-only, the gains are in tool orchestration and multi-step execution, not perception. Training data. ~1,135 SFT samples in single-turn, multi-stepmessages format (system / user / assistant / tool). The system prompt frames an independent task-completion agent and inlines the OpenClaw tool signatures (read, write, exec, web_search, …); each assistant turn carries an explicit reasoning_content field that plans the next batch of tool calls. Trajectories are tool-heavy — ~6–16 assistant turns and 9–17 tool calls per sample — spanning shell, files, browser, and web search. Base model: Qwen3.6-27B.
Metric. Each task is scored 0.00–1.00; the reported number is the weighted average score across the suite (pass@1, temperature = 1.0).
Results. With the simulated-interaction tasks set aside (their outcome depends on an LLM-simulated counterpart’s replies), the SFT model’s overall weighted score improves from 0.43 to 0.52 (+9 pts).
What drove it. Improvement concentrates on retrieval, classification, and information-consolidation tasks — gathering scattered or unstructured inputs, imposing structure, and emitting a precise artifact. Two representative wins: (1) orchestrated retrieval — given a painting, the agent must report where it can be viewed; the image understanding is delegated to a multimodal API the agent calls, and what SFT sharpens is the orchestration around it (invoke that API, search to locate the work, then write the answer in the required format); and (2) information consolidation — a clean text-only case where the agent reads a thread of repeatedly-revised updates and reconciles them into a single accurate status report. In both, SFT improves which tools the model calls, in what order, and how precisely it formats the final output.