Full Dataset

MCP-Atlas is a corpus of multi-step, multi-server tool-use tasks generated by EigenData-CLI to improve model performance on the MCP-Atlas benchmark. Each task spans a ~40-server MCP graph (github, wikipedia, arxiv, filesystem, slack, airtable, notion, mongodb, twelvedata, …), is really executed against a live MCP environment, then frozen so it can be replayed deterministically. Each task pairs a natural-language request with (a) a claim-based reward rubric, (b) a frozen snapshot of every tool I/O it consumed, and (c) restorable environment snapshots, so the same sample trains an SFT model and seeds a reproducible RL rollout.

Want to try it first? A free, individually-verified sample is available on the Demo Samples page.

What MCP-Atlas is

A bundle for downstream RL training and reproducible evaluation:

Multi-step, multi-server tool-use tasks that fan out across many MCP servers — research-style reads (wikipedia, arxiv, github), stateful writes (airtable, notion, mongodb), and filesystem/git work — in a single goal-directed trajectory.
Claim-based grading — each task ships 1–7 drift-resistant claims that define success without referencing volatile facts (star counts, prices, rankings), so the reward stays stable over time.
Frozen environment — every task includes a captured tool-I/O sequence and before/after world snapshots, so external-API calls are served deterministically at replay time with no network drift or cross-task state leakage.

At a glance

Property	Value
Tasks	1,745
Frozen tool-I/O entries	29,245
Drift-resistant claims	5,040 (~3 per task)
Servers in graph	~40 (mirroring the benchmark’s 36 servers / 220 tools, plus fixtures and stubs)
Per task	`task.json` + `claims.json` + `trajectory.json` (SFT) + `env.jsonl` (RL replay) + restorable `env/workspace_{init,final}/`
Grading	per-claim LLM-as-judge; ship gate coverage ≥ 0.75 (matches the MCP-Atlas rubric)
Source	grounded proposer + real tool execution; tool data is real frozen API responses, not generated

What’s inside

Component	Description
Task bundles	One self-contained folder per task, keyed by a 16-hex `task_id`. Each is simultaneously an SFT trajectory and a replayable RL environment.
Reference trajectories	The full `system` / `user` / `assistant` (with reasoning + tool calls) / `tool` conversation (`trajectory.json`), suitable for SFT distillation.
Tool-I/O snapshots	Frozen `(tool, arguments, response)` triples in original execution order (`env.jsonl`) — the reference I/O used to replay external-API calls deterministically.
Claim rubrics	Drift-resistant, independently-verifiable assertions (`claims.json`) with `verify_via` metadata (`exact_match` / `substring` / `count` / `presence`) and a per-claim verdict.
Restorable world	`env/workspace_init/` (before the agent acted) and `env/workspace_final/` (the realized goal state) — `init` to start an RL rollout, `final` as the golden end state for write tasks.

Task composition

Tasks mix three classes of server, by how the environment is reconstructed:

Composition	Tasks	Share
Mixed (live + local)	1,604	91.9%
Pure local (no live API; fully reproducible without cache hits)	117	6.7%
Pure live (no fixture/sandbox)	24	1.4%

The pure-local subset is the strongest reproducibility tier — every call is served by a deterministic local process and can be exercised infinitely without network access.

Trajectory length

Per-task agent-side complexity:

Metric	mean	median	p90	max
Assistant turns (steps)	12.1	11	21	30
Tool calls per task	16.8	16	28	78
Distinct servers per task	4.0	4	5	9
Claims per task	2.9	3	5	7

Tasks run roughly 1.7× more tool calls than the upstream benchmark’s gold trajectories — the proposer prefers multi-server plans, and real execution often retries a step the proposer did not anticipate.

Tool usage

filesystem dominates as a sink for writing artifacts; github and wikipedia dominate as research-style sources. Tasks that touch each server (top of ~40):

Server	Tasks	Tool calls
filesystem	1,590	4,297
github	785	5,679
wikipedia	735	3,963
twelvedata	534	2,150
slack	524	1,275
arxiv	500	3,409
airtable	429	2,311
google-workspace	414	1,386
brave-search	310	1,006

Access & licensing

A free demo sample is available now under CC BY-NC-ND 4.0 — see Demo Samples. The full corpus is available for commercial licensing; contact support@eigenai.com.

Eigen AI

API Reference

Platform

Products

What MCP-Atlas is

At a glance

What’s inside

Task composition

Trajectory length

Tool usage

Access & licensing

​What MCP-Atlas is

​At a glance

​What’s inside

​Task composition

​Trajectory length

​Tool usage

​Access & licensing

What MCP-Atlas is

At a glance

What’s inside

Task composition

Trajectory length

Tool usage

Access & licensing