Want to try it first? A free, individually-verified sample is available on the Demo Samples page.
What MCP-Atlas is
A bundle for downstream RL training and reproducible evaluation:- Multi-step, multi-server tool-use tasks that fan out across many MCP servers — research-style reads (wikipedia, arxiv, github), stateful writes (airtable, notion, mongodb), and filesystem/git work — in a single goal-directed trajectory.
- Claim-based grading — each task ships 1–7 drift-resistant claims that define success without referencing volatile facts (star counts, prices, rankings), so the reward stays stable over time.
- Frozen environment — every task includes a captured tool-I/O sequence and before/after world snapshots, so external-API calls are served deterministically at replay time with no network drift or cross-task state leakage.
At a glance
| Property | Value |
|---|---|
| Tasks | 1,745 |
| Frozen tool-I/O entries | 29,245 |
| Drift-resistant claims | 5,040 (~3 per task) |
| Servers in graph | ~40 (mirroring the benchmark’s 36 servers / 220 tools, plus fixtures and stubs) |
| Per task | task.json + claims.json + trajectory.json (SFT) + env.jsonl (RL replay) + restorable env/workspace_{init,final}/ |
| Grading | per-claim LLM-as-judge; ship gate coverage ≥ 0.75 (matches the MCP-Atlas rubric) |
| Source | grounded proposer + real tool execution; tool data is real frozen API responses, not generated |
What’s inside
| Component | Description |
|---|---|
| Task bundles | One self-contained folder per task, keyed by a 16-hex task_id. Each is simultaneously an SFT trajectory and a replayable RL environment. |
| Reference trajectories | The full system / user / assistant (with reasoning + tool calls) / tool conversation (trajectory.json), suitable for SFT distillation. |
| Tool-I/O snapshots | Frozen (tool, arguments, response) triples in original execution order (env.jsonl) — the reference I/O used to replay external-API calls deterministically. |
| Claim rubrics | Drift-resistant, independently-verifiable assertions (claims.json) with verify_via metadata (exact_match / substring / count / presence) and a per-claim verdict. |
| Restorable world | env/workspace_init/ (before the agent acted) and env/workspace_final/ (the realized goal state) — init to start an RL rollout, final as the golden end state for write tasks. |
Task composition
Tasks mix three classes of server, by how the environment is reconstructed:| Composition | Tasks | Share |
|---|---|---|
| Mixed (live + local) | 1,604 | 91.9% |
| Pure local (no live API; fully reproducible without cache hits) | 117 | 6.7% |
| Pure live (no fixture/sandbox) | 24 | 1.4% |
Trajectory length
Per-task agent-side complexity:| Metric | mean | median | p90 | max |
|---|---|---|---|---|
| Assistant turns (steps) | 12.1 | 11 | 21 | 30 |
| Tool calls per task | 16.8 | 16 | 28 | 78 |
| Distinct servers per task | 4.0 | 4 | 5 | 9 |
| Claims per task | 2.9 | 3 | 5 | 7 |
Tool usage
filesystem dominates as a sink for writing artifacts; github and wikipedia dominate as research-style sources. Tasks that touch each server (top of ~40):
| Server | Tasks | Tool calls |
|---|---|---|
| filesystem | 1,590 | 4,297 |
| github | 785 | 5,679 |
| wikipedia | 735 | 3,963 |
| twelvedata | 534 | 2,150 |
| slack | 524 | 1,275 |
| arxiv | 500 | 3,409 |
| airtable | 429 | 2,311 |
| google-workspace | 414 | 1,386 |
| brave-search | 310 | 1,006 |