- Environment — the simulated world state (MCP server snapshots, databases, or filesystems) that the agent operates in
- Data — generated samples including intents, datapoints, evaluators, and reference payloads
Available Datasets
APEX Agent
Professional knowledge work across investment banking, law, and management consulting — synthesized from scratch inspired by the APEX benchmark.
Personal Agent Bench
Long-horizon personal knowledge-work on a simulated laptop — tax packets, federal returns, reimbursements, and subscription audits across an 8-app environment.
Tau2-Bench
Multi-turn, policy-grounded customer-service dialogs across airline, telecom, and retail, with tool use and machine-checkable success criteria.
Tau3-Bench
Hard, single-domain retail-banking dialogs with dynamically discoverable tools — the agent must search a knowledge base and unlock the right tool at runtime.
Enterprise Bench
Long-horizon agent tasks inside realistic simulated companies — operate the business or answer questions across up to ~40 connected SaaS systems sharing one world state.
WildClawBench
Agentic, tool-using tasks across six capability categories — from PDF parsing to code debugging to safety alignment — built on InternLM’s WildClawBench.
MCP-Atlas
Multi-step, multi-server tool-use tasks over a ~40-server MCP graph — each frozen with a claims-based reward and a replayable environment snapshot. Built on the MCP-Atlas benchmark.
MCPMark
Synthetic, agentic filesystem + GitHub tasks with deterministic Python verifiers — repo archaeology, cross-file joins, and stateful MCP actions, runnable fully offline.
Google Workspace
Everyday Google Workspace tasks — managing emails, calendars, sheets, and contacts across diverse personal and professional scenarios.
Download
The free demo samples are hosted on Hugging Face:Browse on Hugging Face
View and download all demo samples
License
The demo samples are released under CC BY-NC-ND 4.0.- For demonstration and evaluation purposes only
- No commercial use
- No redistribution or derivative works
- No use for model training