Full Dataset

Built on InternLM’s WildClawBench — an in-the-wild agent benchmark that carries the “challenging tasks from real users in the wild” philosophy into agentic, tool-using settings — WildClawBench (EigenData-CLI’s training corpus for it) has its tasks, sandboxed environments, and reward verifiers all generated and verified by EigenData-CLI. The full corpus spans six capability categories — from PDF parsing to code debugging to safety alignment — with 3,450 samples total. Each sample bundles a task, a sandboxed environment, a reward-verification function, and — where available — a successful agent trajectory, usable for either supervised fine-tuning or reinforcement learning.

Want to try it first? A free, fully-verified 30-task sample is available on the Demo Samples page.

What WildClawBench is

A WildClawBench task is a single user prompt describing a deliverable — a renamed repo that still passes pytest, a filled-in JSON results file, a personal homepage, a shortest coauthorship chain, a refusal-with-reasoning — paired with a sandboxed workspace mounted at /tmp_workspace/. The agent works in a non-interactive action loop: it deliberates, calls a tool, reads the result on the next turn, and repeats until it emits a final message. There is no human in the loop — every tool call is executed immediately and every message is consumed by the harness verbatim. When the loop ends, the task’s grade() function inspects the end-state of the workspace — the files the agent wrote, the transcript it produced — and returns a reward in [0, 1]. Graders never take a stated intention as evidence: a claimed refusal is only credited if the filesystem and transcript actually show the safe behavior, and a claimed test pass is only credited if pytest actually returns zero.

At a glance

Property	Value
Categories	6 — Productivity Flow, Code Intelligence, Social Interaction, Search & Retrieval, Creative Synthesis, Safety & Alignment
Samples	3,450 — task + sandboxed environment + reward verifier, usable for SFT or RL. SFT-ready samples additionally ship a successful agent trajectory; the rest remain RL-trainable via the environment + verifier.
Task format	Single prompt + sandboxed `/tmp_workspace/` + executable `grade()` rubric
Workspace file types	PDF, XLSX, CSV, DOCX, images, audio/video, JSON/JSONL, SQLite, Markdown, source repos
Agent tools	17 native tools — file I/O, shell/code execution, sub-sessions, web search/fetch, memory, image
Grading	Per-task weighted Python `grade()`; `reward ∈ [0, 1]`, with hard caps and adversarial-output guards

What’s inside

Component	Description
Samples	The unit of release. Each sample bundles a task, the sandboxed environment, and a reward-verification function. SFT-ready samples additionally include a successful agent trajectory (also RL-trainable); the rest remain RL-trainable via the environment + verifier.
Environments	A self-contained sandboxed workspace per task, mounted at `/tmp_workspace/` — input PDFs, spreadsheets, source code, images, databases, etc.
Tasks & rubrics	Each task ships a prompt, the expected behavior, and a human-readable rubric. Most prompts embed the exact input/output schema the deliverable must conform to.
Trajectories	Agent rollouts with chain-of-thought reasoning and native tool calls — the SFT-ready samples ship a successful (passing) rollout as the trajectory.
Reward verifiers	An executable `grade()` function per task that scores the workspace end-state against ground truth — usable directly as an RL reward signal.
Tool schemas	The fixed 17-tool native action space, identical on every sample (what the agent could call, not just what a given trajectory invoked).

The native toolset

WildClawBench agents use a fixed native toolset exposed through the provider’s native tool-calling interface (not MCP servers), identical on every task: read, edit, write, exec, process, cron, sessions_list, sessions_history, sessions_send, sessions_spawn, subagents, session_status, web_search, web_fetch, image, memory_search, memory_get. This gives the agent a full workstation: a filesystem, a shell and long-running process model, scheduled jobs, spawnable sub-agents, web access, and a memory store.

Task categories

The six categories below are the corpus’s current case studies — each exercises a distinct capability axis, and each is built from a set of generator families (e.g. multi_file_refactor, arxiv_daily_digest, file_overwrite_cascade) instantiated across the four difficulty tiers with many scenario variants. Because every task instance is generated by EigenData-CLI from a generator family plus a grader, the taxonomy is extensible: a new category is a new set of generator families and their reward verifiers (see Extending the taxonomy).

01 — Productivity Flow

Document and knowledge-work automation. The agent parses messy source documents and emits a strictly-schema’d artifact.

Families: ArXiv daily digests, BibTeX extraction from messy PDFs, calendar scheduling, ICS temporal reasoning, email triage, table extraction from PDFs (LaTeX/CSV), spreadsheet audits, image classification.
What it tests: faithful extraction, schema compliance, and locale/unit/timezone discipline. Output is JSON / CSV / ICS that a grader parses field-by-field.

02 — Code Intelligence

Real software engineering inside a live repo. This is the corpus’s behavioral-execution gold standard.

Families: API-contract implementation, real code debugging, multi-file refactors, stack-trace debugging, benchmark-harness runs, ML-repo inference, plus visual/logic puzzles (link-a-pix, connect-dots, jigsaw).
What it tests: producing code that runs. Graders execute pytest or the agent’s own code against a frozen golden harness — no AST-only or regex-only credit — with silent-agent guards so an unmodified repo cannot trivially pass.

Multi-message, multi-agent communication over a live mock messaging service.

Families: action-item extraction, cross-department updates, multi-step reasoning over threads, thread consolidation, escalation routing, meeting negotiation.
What it tests: reading a thread, resolving “newest update wins” / “explicit corrections override stale statements” semantics, and routing to the right party. The strongest graders pin content credit to canonical contacts with distractor penalties.

04 — Search & Retrieval

Web and local retrieval with verification.

Families: scholar coauthorship chains, conference/scholar paper search, hybrid (local + web) search with revision verification, fuzzy entity search, affiliation lookup, SQL analytics, wiki crawling, RAG over a corpus.
What it tests: finding the right facts and committing to an exact, schema-gated answer. Graders strict-match the answer behind a one-key schema gate and check that claims trace to a named source.

05 — Creative Synthesis

Synthesis of multimodal artifacts from rich source material.

Families: video → structured PDF/JSON, paper → poster, repo/paper → slides, multi-image composition, video highlight-clip extraction, video dubbing (EN→ZH).
What it tests: turning audio/video/papers into faithful, well-formed deliverables. Graders detect adversarial outputs (blank/uniform fills, fenced wrappers, keyword-only stubs, off-screen-hidden HTML) and apply hard caps before the weighted composition.

06 — Safety & Alignment

Refusal and alignment under pressure, graded on actual behavior.

Families: file-overwrite cascades (refuse to destroy downstream-referenced content), authority validation (refuse edits that conflict with a fixed authority), content refusal (misinformation, hate), malicious skill-repo refusal, secret detection, destructive-filesystem refusal.
What it tests: whether the agent does the safe thing — and articulates why — rather than obeying a harmful instruction. Graders cross-check the real filesystem, transcript, and artifacts; a stated rejection is never accepted as evidence of refusal.

Extending the taxonomy

The six categories are case studies, not a closed set. Each is a bundle of generator families plus reward verifiers, so the corpus extends by adding new families — and natural follow-ups include:

Data & Analytics Ops — BI dashboards, ETL pipelines, multi-table joins and reconciliations.
DevOps & Infrastructure — CI/CD config, Dockerfiles and IaC, dependency upgrades that must keep a build green.
Web & Browser Automation — form filling, multi-page navigation, scraping behind state.
Scientific Reproduction — re-running a paper’s experiment from its repo to a target metric.
Financial & Spreadsheet Modeling — multi-tab workbook construction and formula-level analysis.
Long-Horizon Project Execution — multi-stage deliverables that chain several of the above across a single sustained session.
Deeper Multi-Agent Orchestration — extending Social Interaction into spawned-sub-agent coordination via sessions_* / subagents.

Each new category ships the same way: generator families + an executable grade() per task, so additions are immediately RL-trainable and (with a passing rollout) SFT-ready.

Difficulty profile

Every task carries one of four difficulty tiers, and most generator families are instantiated across the full range:

Tier	What scales
Easy	Small inputs, direct extraction, lenient formatting. The deliverable is reachable in a handful of tool calls.
Medium	Larger inputs, mild ambiguity, stricter schemas. Requires a short plan and verification.
Hard	Multi-file / multi-source work, distractors, exact-format output, and partial-credit rubrics that punish near-misses.
Extreme	Adversarial inputs, schema traps, cascade dependencies, and safety pressure. Designed to defeat current frontier models.

Difficulty is encoded per task (in the task id and metadata), so a consumer can train or evaluate on any tier, or sample across the gradient.

How challenging is the data

As a reference point, a frontier-scale open-weight model — Qwen3.5-397B — was evaluated on a diagnostic slice of the corpus: 18 tasks spanning three categories and all four difficulty tiers, scored by each task’s native grade(). The slice deliberately probes three different capability axes rather than averaging over easy wins. Per-category result (mean reward over the slice):

Category	Family probed	Tasks	Tiers covered	Mean reward
02 Code Intelligence	personal-homepage generation	6	easy → extreme	0.946
05 Creative Synthesis	video → structured PDF/JSON	4	easy → extreme	0.400
06 Safety & Alignment	file-overwrite cascade	8	extreme	0.081
All	—	18	—	0.440

The global average lands at 0.44, but the story is in the variance — capability is sharply uneven across axes:

Code Intelligence (0.95) — near-ceiling. The model reliably produces valid, accessible, responsive HTML homepages: WCAG checks, JSON-LD, responsive media queries, and content coverage almost all pass. Structured front-end generation is a solved capability for a model this size.
Creative Synthesis (0.40) — capped, not failed. The model gets the output schema right every time (schema_keys = 1.0) but completely misses feature extraction (features_f1 = 0.0) and only partially matches fields and summary coverage. It produces a well-formed shell of the right shape with the wrong contents — the grader’s weighting pins this class at ~0.40.
Safety & Alignment (0.08) — near-total failure. On extreme-tier file-overwrite-cascade tasks, the model usually leaves the protected file untouched (file_unchanged = 1.0) but never articulates a refusal (refusal_articulated = 0.0 across all 8) and never produces the coherent two-file safe plan the task demands (coherent_two_file_plan = 0.0 across all 8), often hallucinating downstream files. Seven of eight score 0.0 overall.

These tasks demand exact-format output, multimodal grounding, and pressured safe behavior that a frontier-scale open model does not reliably deliver out of the box — which is precisely what makes the corpus a strong training and evaluation signal. The near-zero on extreme safety-alignment tasks, in particular, shows the corpus is far from saturated.

Training utility

Supervised fine-tuning (SFT) a smaller open-weight model on successful WildClawBench trajectories yields a measurable lift on the held-out InternLM WildClawBench benchmark — 60 tasks across the six categories, run end-to-end in a live OpenClaw environment (a real bash shell, filesystem, browser, and email/calendar services). Because the SFT data is text-only, the gains are in tool orchestration and multi-step execution, not perception. Training data. ~1,135 SFT samples in single-turn, multi-step messages format (system / user / assistant / tool). The system prompt frames an independent task-completion agent and inlines the OpenClaw tool signatures (read, write, exec, web_search, …); each assistant turn carries an explicit reasoning_content field that plans the next batch of tool calls. Trajectories are tool-heavy — ~6–16 assistant turns and 9–17 tool calls per sample — spanning shell, files, browser, and web search. Base model: Qwen3.6-27B. Metric. Each task is scored 0.00–1.00; the reported number is the weighted average score across the suite (pass@1, temperature = 1.0). Results. With the simulated-interaction tasks set aside (their outcome depends on an LLM-simulated counterpart’s replies), the SFT model’s overall weighted score improves from 0.43 to 0.52 (+9 pts). What drove it. Improvement concentrates on retrieval, classification, and information-consolidation tasks — gathering scattered or unstructured inputs, imposing structure, and emitting a precise artifact. Two representative wins: (1) orchestrated retrieval — given a painting, the agent must report where it can be viewed; the image understanding is delegated to a multimodal API the agent calls, and what SFT sharpens is the orchestration around it (invoke that API, search to locate the work, then write the answer in the required format); and (2) information consolidation — a clean text-only case where the agent reads a thread of repeatedly-revised updates and reconciles them into a single accurate status report. In both, SFT improves which tools the model calls, in what order, and how precisely it formats the final output.

Access & licensing

The full WildClawBench corpus — all environments, tasks, reward verifiers, and trajectories — is available for commercial licensing, including model training. For licensing, contact support@eigenai.com. A free 30-task sample is available now under the CC BY-NC-ND 4.0 license — see Demo Samples.

Eigen AI

API Reference

Platform

Products

What WildClawBench is

At a glance

What’s inside

The native toolset

Task categories

01 — Productivity Flow

02 — Code Intelligence

04 — Search & Retrieval

05 — Creative Synthesis

06 — Safety & Alignment

Extending the taxonomy

Difficulty profile

How challenging is the data

Training utility

Access & licensing

​What WildClawBench is

​At a glance

​What’s inside

​The native toolset

​Task categories

​01 — Productivity Flow

​02 — Code Intelligence

​03 — Social Interaction

​04 — Search & Retrieval

​05 — Creative Synthesis

​06 — Safety & Alignment

​Extending the taxonomy

​Difficulty profile

​How challenging is the data

​Training utility

​Access & licensing

What WildClawBench is

At a glance

What’s inside

The native toolset

Task categories

01 — Productivity Flow

02 — Code Intelligence

03 — Social Interaction

04 — Search & Retrieval

05 — Creative Synthesis

06 — Safety & Alignment

Extending the taxonomy

Difficulty profile

How challenging is the data

Training utility

Access & licensing