Skip to main content
MCP-Atlas is a corpus of multi-step, multi-server tool-use tasks generated by EigenData-CLI to improve model performance on the MCP-Atlas benchmark. Each task spans a ~40-server MCP graph (github, wikipedia, arxiv, filesystem, slack, airtable, notion, mongodb, twelvedata, …), is really executed against a live MCP environment, then frozen so it can be replayed deterministically. Each task pairs a natural-language request with (a) a claim-based reward rubric, (b) a frozen snapshot of every tool I/O it consumed, and (c) restorable environment snapshots, so the same sample trains an SFT model and seeds a reproducible RL rollout.
Want to try it first? A free, individually-verified sample is available on the Demo Samples page.

What MCP-Atlas is

A bundle for downstream RL training and reproducible evaluation:
  • Multi-step, multi-server tool-use tasks that fan out across many MCP servers — research-style reads (wikipedia, arxiv, github), stateful writes (airtable, notion, mongodb), and filesystem/git work — in a single goal-directed trajectory.
  • Claim-based grading — each task ships 1–7 drift-resistant claims that define success without referencing volatile facts (star counts, prices, rankings), so the reward stays stable over time.
  • Frozen environment — every task includes a captured tool-I/O sequence and before/after world snapshots, so external-API calls are served deterministically at replay time with no network drift or cross-task state leakage.

At a glance

PropertyValue
Tasks1,745
Frozen tool-I/O entries29,245
Drift-resistant claims5,040 (~3 per task)
Servers in graph~40 (mirroring the benchmark’s 36 servers / 220 tools, plus fixtures and stubs)
Per tasktask.json + claims.json + trajectory.json (SFT) + env.jsonl (RL replay) + restorable env/workspace_{init,final}/
Gradingper-claim LLM-as-judge; ship gate coverage ≥ 0.75 (matches the MCP-Atlas rubric)
Sourcegrounded proposer + real tool execution; tool data is real frozen API responses, not generated

What’s inside

ComponentDescription
Task bundlesOne self-contained folder per task, keyed by a 16-hex task_id. Each is simultaneously an SFT trajectory and a replayable RL environment.
Reference trajectoriesThe full system / user / assistant (with reasoning + tool calls) / tool conversation (trajectory.json), suitable for SFT distillation.
Tool-I/O snapshotsFrozen (tool, arguments, response) triples in original execution order (env.jsonl) — the reference I/O used to replay external-API calls deterministically.
Claim rubricsDrift-resistant, independently-verifiable assertions (claims.json) with verify_via metadata (exact_match / substring / count / presence) and a per-claim verdict.
Restorable worldenv/workspace_init/ (before the agent acted) and env/workspace_final/ (the realized goal state) — init to start an RL rollout, final as the golden end state for write tasks.

Task composition

Tasks mix three classes of server, by how the environment is reconstructed:
CompositionTasksShare
Mixed (live + local)1,60491.9%
Pure local (no live API; fully reproducible without cache hits)1176.7%
Pure live (no fixture/sandbox)241.4%
The pure-local subset is the strongest reproducibility tier — every call is served by a deterministic local process and can be exercised infinitely without network access.

Trajectory length

Per-task agent-side complexity:
Metricmeanmedianp90max
Assistant turns (steps)12.1112130
Tool calls per task16.8162878
Distinct servers per task4.0459
Claims per task2.9357
Tasks run roughly 1.7× more tool calls than the upstream benchmark’s gold trajectories — the proposer prefers multi-server plans, and real execution often retries a step the proposer did not anticipate.

Tool usage

filesystem dominates as a sink for writing artifacts; github and wikipedia dominate as research-style sources. Tasks that touch each server (top of ~40):
ServerTasksTool calls
filesystem1,5904,297
github7855,679
wikipedia7353,963
twelvedata5342,150
slack5241,275
arxiv5003,409
airtable4292,311
google-workspace4141,386
brave-search3101,006

Access & licensing

A free demo sample is available now under CC BY-NC-ND 4.0 — see Demo Samples. The full corpus is available for commercial licensing; contact support@eigenai.com.