Demo Samples

These demo samples are a free slice of MCP-Atlas Synthesis — agentic tool-use data generated against a local mirror of the MCP-Atlas benchmark (36 MCP servers, 220 tools). Every sample is a both-ready bundle: one folder that is simultaneously an SFT trajectory and a replayable RL environment with a claims-based reward, so you can open one sample, read the task, follow the agent trajectory, restore the world, run the reward grader, and trust the ground truth — with zero machine-specific paths. Unlike a single answer-key, each bundle ships the environment it was solved in (initial + final filesystem / git / DB snapshots) and a claims-based reward aligned to the benchmark’s own coverage ≥ 0.75 rubric — so the same sample trains an SFT model and seeds an RL rollout.

Overview

Property	Value
Bundles	one self-contained folder per task, keyed by a 16-hex `task_id`
Servers	31 of the benchmark’s MCP servers exercised across the set — `filesystem`, `wikipedia`, `osm-mcp-server`, `airtable`, `github`, `mongodb`, `git`, `arxiv`, `whois`, `brave-search`, `national-parks`, …
Per bundle	`task.json` + `claims.json` + `trajectory.json` (SFT) + `env.jsonl` (RL replay) + `env/workspace_{init,final}/` (restorable world) + `manifest.json`
Claims	natural-language, independently-verifiable facts (`exact_match` / `substring` / `count` / `presence`); ~3.4 per task
Grading	per-claim LLM-as-judge; ship gate = coverage ≥ 0.75 (matches MCP-Atlas `mcp_evals_scores.py`); `manifest` records real `coverage` + `all_pass`
Source	grounded proposer + real tool execution; agent re-solve by `qwen3-5-397b` in a per-task sandbox; tool data is real frozen API responses (read-through corpus), not generated
Readiness	every bundle `sft_ready`; ~94% also `rl_ready` (`n_hollow_steps == 0` — no empty/un-replayable tool responses)

Data

mcpatlas/
└── <task_id>/                          # e.g. 15825e654192e79c
    ├── task.json                       # the task: prompt + expected artifact + claims + servers
    ├── claims.json                     # the reward: claims + per-claim grades (pass + reasoning)
    ├── trajectory.json                 # the SFT trajectory: [system, user, assistant(+reasoning+tool_calls), tool, …]
    ├── env.jsonl                       # the RL replay: one line per tool call (tool, arguments, real response)
    ├── manifest.json                   # readiness flags + server roles + counts
    └── env/
        ├── workspace_init/             # world BEFORE the agent acted (workspace/ + git_repo/ + memory.json)
        └── workspace_final/            # world AFTER — the realized goal state

The same task_id is the bundle. The task’s servers live in a field inside task.json ("servers": ["filesystem","git","mongodb"]), not in directory names.

What each file holds

File	Role	Contents
`task.json`	the task	`task` (NL request) · `expected_artifact` · `claims[]` (`id` / `text` / `source_step` / `verify_via`) · `servers[]`
`claims.json`	the reward	the same `claims[]` plus `grades[]` — per claim a `pass` bool and the grader’s `reasoning` (which step satisfied it)
`trajectory.json`	the SFT demo	OpenAI-style messages: `system` / `user` / `assistant` (with `reasoning_content` + `tool_calls`) / `tool` — the real multi-step solve
`env.jsonl`	the RL replay	one JSON line per executed tool call: `tool_call_id` / `tool` / `arguments` / `response` (real frozen output) / `is_error`
`manifest.json`	metadata	`sft_ready` / `rl_ready` / `coverage` / `all_pass` · `n_claims` / `n_steps` / `n_hollow_steps` · server roles (`sandboxed` / `shared_fixture` / `replayable`)
`env/workspace_{init,final}/`	the world	restorable snapshots of the stateful servers (`workspace/` for filesystem, `git_repo/` for git, `memory.json` for the knowledge graph) — `init` to start an RL rollout, `final` as the golden end state

The key design point: a write task has a realized world (the final snapshot is what “success” looks like), while the claims are the system-agnostic reward — a fact like “the ‘community_garden’ database contains a collection ‘garden_plots’” is checkable regardless of how the agent got there.

env.jsonl is a faithful replay of the agent’s tool I/O — every response is the real result the tool returned during the solve (from the read-through corpus, which serves frozen real API data on first reference and grows on miss). A bundle is rl_ready only when n_hollow_steps == 0, i.e. no step returned an empty / un-replayable response.

Server roles (per `manifest`)

Each bundle classifies the MCP servers it touched by how the environment is reconstructed:

Role	Meaning	Examples
`sandboxed_servers`	per-task isolated, snapshotted in `env/workspace_{init,final}`	`filesystem`, `git`, `memory`
`shared_fixture_servers`	stateful, backed by a shared seed fixture	`mongodb`, `notion`, `slack`, `airtable`
`replayable_servers`	stateless lookups, served from the frozen corpus / live read-through	`wikipedia`, `whois`, `arxiv`, `brave-search`, `osm-mcp-server`

Full task: Mustang-restoration project tracker (`03085ddcf7047f3e`)

Task & claims

Task:

I’m restoring a 1967 Ford Mustang in Austin and need to set up a project tracker. Look up Austin’s geographic coordinates, verify the domain mustangrestore.com is legitimate by checking its registration status, calculate a preliminary budget from parts costs of $5000,$ 3500, $2000, and$ 500, and confirm there’s a Project Management base available to track the restoration tasks.

Expected artifact:

A project-tracker summary containing Austin’s coordinates, the domain-verification status, the calculated budget total, and confirmation of a Project Management base.

Servers (5): osm-mcp-server, whois, calculator, airtable, ddg-searchClaims (4):

#	`verify_via`	Claim
c1	`substring`	Austin, Texas has coordinates ≈ 30.27°N latitude and 97.74°W longitude.
c2	`substring`	The domain `mustangrestore.com` has WHOIS status codes including `clientDeleteProhibited`.
c3	`exact_match`	The sum of 5000, 3500, 2000, and 500 equals 11000.
c4	`substring`	An Airtable base named `Project Management` exists in the system.

Agent trajectory (verbatim from trajectory.json)

Each assistant step is shown with its four message components separated exactly as they appear in trajectory.json: Reasoning (reasoning_content, the chain-of-thought) · Tool calls (tool_calls) · Tool results (the following tool-role messages) · and, on the last step, the assistant Content (the user-visible answer).System (role: system):

You are an autonomous agent. Solve the user’s task by calling the provided tools. Think step by step; decide the next action from what you’ve observed. Don’t fabricate tool outputs; rely only on real results.

User (role: user):

I’m restoring a 1967 Ford Mustang in Austin and need to set up a project tracker. Look up Austin’s geographic coordinates, verify the domain mustangrestore.com is legitimate by checking its registration status, calculate a preliminary budget from parts costs of $5000,$ 3500, $2000, and$ 500, and confirm there’s a Project Management base available to track the restoration tasks.

Step 1 — fan out the four independent readsReasoning (reasoning_content):

Let me break down the task: (1) look up Austin’s geographic coordinates, (2) verify the domain mustangrestore.com by checking its registration status, (3) calculate a preliminary budget from the parts costs, (4) confirm a Project Management base. These are independent — I’ll issue them together.

Tool calls (tool_calls):

[
  {"name": "osm-mcp-server__geocode_address", "arguments": {"address": "Austin, Texas"}},
  {"name": "whois__whois_domain",            "arguments": {"domain": "mustangrestore.com"}},
  {"name": "calculator__calculate",          "arguments": {"expression": "5000 + 3500 + 2000 + 500"}},
  {"name": "airtable__list_bases",           "arguments": {}}
]

Tool results (role: tool):

{"place_id": 313995209, "lat": "30.2851494", "lon": "-97.7339352", "display_name": "Austin, Travis County, Texas, ..."}
{"Domain Status": ["clientDeleteProhibited", "clientTransferProhibited", ...], "Registrar": "GoDaddy.com, LLC", "Created": "2013-04-13", "Expires": "2027-04-13"}
{"text": "11000", "is_error": false}
{"bases": [{"id": "BASE001", "name": "Sales CRM"}, {"id": "BASE004", "name": "Project Management"}, ...]}

Step 2 — try to fetch the (not-yet-existing) summary fileReasoning:

Great, I have all four facts: Austin (30.285, -97.734); mustangrestore.com is registered with GoDaddy (created 2013-04-13, expires 2027); budget $11,000; and the Project Management base exists. Let me check the summary file the task implies.

Tool calls:

[{"name": "ddg-search__fetch_content", "arguments": {"url": "file:///.../mustang_restoration_tracker.md"}}]

Tool results:

{"url": "file:///.../mustang_restoration_tracker.md", "note": "content fetch not in grounded corpus; use the search result snippets"}

Step 3 — recognize the limitation, pivot to the Airtable baseReasoning:

The fetch_content tool doesn’t work for local files, and I don’t see a direct file-write tool in the available functions. Let me inspect the Project Management base (BASE004) instead.

Tool calls:

[{"name": "airtable__list_tables", "arguments": {"base_id": "BASE004"}}]

Tool results:

{"tables": [{"id": "tbl_BASE004_0", "name": "Tasks"}, {"name": "Restoration Milestones"}, ...]}

Step 4 — write the kickoff record into the confirmed baseReasoning:

The Project Management base (BASE004) has a “Restoration Milestones” table, which is relevant for tracking the project. I’ll create the kickoff record there.

Tool calls:

[{"name": "airtable__create_record", "arguments": {
  "base_id": "BASE004", "table_name": "Restoration Milestones",
  "fields": {"field_a": "1967 Ford Mustang restoration kickoff", ...}}}]

Tool results:

{"record": {"id": "rec1e1a5a06ac", "fields": {"field_a": "1967 Ford Mustang restoration kickoff", ...}}}

Step 5 — final answerReasoning:

I’ve created the record. The task asked for a “project tracker summary”; since there’s no file-write tool, I’ll deliver the summary directly to the user.

Content (content, user-visible):

I’ve completed all the required tasks for your 1967 Ford Mustang restoration project tracker:

Austin coordinates: 30.2851, -97.7339

Domain: mustangrestore.com is registered (GoDaddy, created 2013-04-13, status clientDeleteProhibited) — legitimate

Preliminary budget: $11,000

Project Management base (BASE004) confirmed; a kickoff record was created in the Restoration Milestones table

Note the agent (a) fans out the four independent reads in one step, (b) discovers BASE004 via list_bases before writing to it, and (c) when fetch_content can’t serve a local file, states the limitation and pivots rather than fabricating a written file.

Grading & ground truth (4/4 claims, coverage 1.0)

claims.json ships the per-claim verdict the grader produced — all four grounded in real tool output:

{ "grades": [
  { "id": "c1", "pass": true, "reasoning": "Step 1 geocoded Austin, Texas → (30.285, -97.734), matching ≈30.27 / 97.74." },
  { "id": "c2", "pass": true, "reasoning": "Step 2 WHOIS for mustangrestore.com shows Domain Status incl. clientDeleteProhibited." },
  { "id": "c3", "pass": true, "reasoning": "Step 3 calculator output is exactly 11000 for 5000+3500+2000+500." },
  { "id": "c4", "pass": true, "reasoning": "Step 4 airtable list_bases shows a base named 'Project Management' (BASE004)." }
] }

coverage = 4/4 = 1.0 (all_pass=true). The realized world — the Airtable kickoff record — is captured in the shared fixture; whois / osm / calculator are stateless replayable_servers served from real frozen responses, so env/workspace_init carries the sandbox baseline an RL rollout restores from.

How to consume

SFT — trajectory.json is ready-to-train OpenAI-style messages (system / user / assistant-with-reasoning_content+tool_calls / tool). The action space (every tool of every server the task touches) comes from the benchmark’s upstream tool schemas. RL — restore env/workspace_init/ into a fresh per-task sandbox, let your policy act, then grade its trajectory against claims.json (coverage ≥ 0.75 = pass). env.jsonl provides the reference tool I/O for replay or for seeding the rollout; env/workspace_final/ is the golden end state for write tasks.

Verification

Every bundle was delivered by a grader, not by signal alone: the agent re-solved the task in a fresh per-task sandbox, and a per-claim LLM judge scored its trajectory. A bundle ships only when coverage ≥ 0.75 (the MCP-Atlas pass threshold) and no tool returned a not-implemented / broken result. The manifest records the real coverage and an all_pass flag, so a strict consumer can post-filter to 100%-claim bundles.

Provenance

Tasks are proposed by an LLM grounded in the live tool graph, then really executed against the local MCP mirror — tool calls and arguments are grounded in actual tool responses, not invented.
Tool data is real. Mocked servers serve frozen real API responses (e.g. real Met objects, real Wikipedia articles, real GitHub repos); on a cache miss the read-through layer fetches the genuine upstream and appends it to the corpus. Mock tool schemas are byte-identical to the benchmark’s upstream servers, so a trajectory transfers to the live eval unchanged.
Claims are factual and non-time-sensitive (no “current price” / “today’s” style claims), extracted from the executed trajectory and graded against it.

Download

# Download MCP-Atlas demo bundles
hf download jindidi/eigendata-demo-data --repo-type dataset --include "mcpatlas/*"

Browse on Hugging Face

View MCP-Atlas files

For the complete MCP-Atlas corpus — its scale, the full server graph, and commercial licensing — see the Full Dataset page.

Eigen AI

API Reference

Platform

Products

Overview

Data

What each file holds

Server roles (per `manifest`)

Full task: Mustang-restoration project tracker (`03085ddcf7047f3e`)

How to consume

Verification

Provenance

Download

Browse on Hugging Face

​Overview

​Data

​What each file holds

​Server roles (per manifest)

​Full task: Mustang-restoration project tracker (03085ddcf7047f3e)

​How to consume

​Verification

​Provenance

​Download

Browse on Hugging Face

Overview

Data

What each file holds

Server roles (per `manifest`)

Full task: Mustang-restoration project tracker (`03085ddcf7047f3e`)

How to consume

Verification

Provenance

Download