Skip to main content
These demo samples are a free slice of MCP-Atlas Synthesis — agentic tool-use data generated against a local mirror of the MCP-Atlas benchmark (36 MCP servers, 220 tools). Every sample is a both-ready bundle: one folder that is simultaneously an SFT trajectory and a replayable RL environment with a claims-based reward, so you can open one sample, read the task, follow the agent trajectory, restore the world, run the reward grader, and trust the ground truth — with zero machine-specific paths. Unlike a single answer-key, each bundle ships the environment it was solved in (initial + final filesystem / git / DB snapshots) and a claims-based reward aligned to the benchmark’s own coverage ≥ 0.75 rubric — so the same sample trains an SFT model and seeds an RL rollout.

Overview

PropertyValue
Bundlesone self-contained folder per task, keyed by a 16-hex task_id
Servers31 of the benchmark’s MCP servers exercised across the set — filesystem, wikipedia, osm-mcp-server, airtable, github, mongodb, git, arxiv, whois, brave-search, national-parks, …
Per bundletask.json + claims.json + trajectory.json (SFT) + env.jsonl (RL replay) + env/workspace_{init,final}/ (restorable world) + manifest.json
Claimsnatural-language, independently-verifiable facts (exact_match / substring / count / presence); ~3.4 per task
Gradingper-claim LLM-as-judge; ship gate = coverage ≥ 0.75 (matches MCP-Atlas mcp_evals_scores.py); manifest records real coverage + all_pass
Sourcegrounded proposer + real tool execution; agent re-solve by qwen3-5-397b in a per-task sandbox; tool data is real frozen API responses (read-through corpus), not generated
Readinessevery bundle sft_ready; ~94% also rl_ready (n_hollow_steps == 0 — no empty/un-replayable tool responses)

Data

mcpatlas/
└── <task_id>/                          # e.g. 15825e654192e79c
    ├── task.json                       # the task: prompt + expected artifact + claims + servers
    ├── claims.json                     # the reward: claims + per-claim grades (pass + reasoning)
    ├── trajectory.json                 # the SFT trajectory: [system, user, assistant(+reasoning+tool_calls), tool, …]
    ├── env.jsonl                       # the RL replay: one line per tool call (tool, arguments, real response)
    ├── manifest.json                   # readiness flags + server roles + counts
    └── env/
        ├── workspace_init/             # world BEFORE the agent acted (workspace/ + git_repo/ + memory.json)
        └── workspace_final/            # world AFTER — the realized goal state
The same task_id is the bundle. The task’s servers live in a field inside task.json ("servers": ["filesystem","git","mongodb"]), not in directory names.

What each file holds

FileRoleContents
task.jsonthe tasktask (NL request) · expected_artifact · claims[] (id / text / source_step / verify_via) · servers[]
claims.jsonthe rewardthe same claims[] plus grades[] — per claim a pass bool and the grader’s reasoning (which step satisfied it)
trajectory.jsonthe SFT demoOpenAI-style messages: system / user / assistant (with reasoning_content + tool_calls) / tool — the real multi-step solve
env.jsonlthe RL replayone JSON line per executed tool call: tool_call_id / tool / arguments / response (real frozen output) / is_error
manifest.jsonmetadatasft_ready / rl_ready / coverage / all_pass · n_claims / n_steps / n_hollow_steps · server roles (sandboxed / shared_fixture / replayable)
env/workspace_{init,final}/the worldrestorable snapshots of the stateful servers (workspace/ for filesystem, git_repo/ for git, memory.json for the knowledge graph) — init to start an RL rollout, final as the golden end state
The key design point: a write task has a realized world (the final snapshot is what “success” looks like), while the claims are the system-agnostic reward — a fact like “the ‘community_garden’ database contains a collection ‘garden_plots’” is checkable regardless of how the agent got there.
env.jsonl is a faithful replay of the agent’s tool I/O — every response is the real result the tool returned during the solve (from the read-through corpus, which serves frozen real API data on first reference and grows on miss). A bundle is rl_ready only when n_hollow_steps == 0, i.e. no step returned an empty / un-replayable response.

Server roles (per manifest)

Each bundle classifies the MCP servers it touched by how the environment is reconstructed:
RoleMeaningExamples
sandboxed_serversper-task isolated, snapshotted in env/workspace_{init,final}filesystem, git, memory
shared_fixture_serversstateful, backed by a shared seed fixturemongodb, notion, slack, airtable
replayable_serversstateless lookups, served from the frozen corpus / live read-throughwikipedia, whois, arxiv, brave-search, osm-mcp-server

Full task: Mustang-restoration project tracker (03085ddcf7047f3e)

Task:
I’m restoring a 1967 Ford Mustang in Austin and need to set up a project tracker. Look up Austin’s geographic coordinates, verify the domain mustangrestore.com is legitimate by checking its registration status, calculate a preliminary budget from parts costs of 5000,5000, 3500, 2000,and2000, and 500, and confirm there’s a Project Management base available to track the restoration tasks.
Expected artifact:
A project-tracker summary containing Austin’s coordinates, the domain-verification status, the calculated budget total, and confirmation of a Project Management base.
Servers (5): osm-mcp-server, whois, calculator, airtable, ddg-searchClaims (4):
#verify_viaClaim
c1substringAustin, Texas has coordinates ≈ 30.27°N latitude and 97.74°W longitude.
c2substringThe domain mustangrestore.com has WHOIS status codes including clientDeleteProhibited.
c3exact_matchThe sum of 5000, 3500, 2000, and 500 equals 11000.
c4substringAn Airtable base named Project Management exists in the system.
Each assistant step is shown with its four message components separated exactly as they appear in trajectory.json: Reasoning (reasoning_content, the chain-of-thought) · Tool calls (tool_calls) · Tool results (the following tool-role messages) · and, on the last step, the assistant Content (the user-visible answer).System (role: system):
You are an autonomous agent. Solve the user’s task by calling the provided tools. Think step by step; decide the next action from what you’ve observed. Don’t fabricate tool outputs; rely only on real results.
User (role: user):
I’m restoring a 1967 Ford Mustang in Austin and need to set up a project tracker. Look up Austin’s geographic coordinates, verify the domain mustangrestore.com is legitimate by checking its registration status, calculate a preliminary budget from parts costs of 5000,5000, 3500, 2000,and2000, and 500, and confirm there’s a Project Management base available to track the restoration tasks.

Step 1 — fan out the four independent readsReasoning (reasoning_content):
Let me break down the task: (1) look up Austin’s geographic coordinates, (2) verify the domain mustangrestore.com by checking its registration status, (3) calculate a preliminary budget from the parts costs, (4) confirm a Project Management base. These are independent — I’ll issue them together.
Tool calls (tool_calls):
[
  {"name": "osm-mcp-server__geocode_address", "arguments": {"address": "Austin, Texas"}},
  {"name": "whois__whois_domain",            "arguments": {"domain": "mustangrestore.com"}},
  {"name": "calculator__calculate",          "arguments": {"expression": "5000 + 3500 + 2000 + 500"}},
  {"name": "airtable__list_bases",           "arguments": {}}
]
Tool results (role: tool):
{"place_id": 313995209, "lat": "30.2851494", "lon": "-97.7339352", "display_name": "Austin, Travis County, Texas, ..."}
{"Domain Status": ["clientDeleteProhibited", "clientTransferProhibited", ...], "Registrar": "GoDaddy.com, LLC", "Created": "2013-04-13", "Expires": "2027-04-13"}
{"text": "11000", "is_error": false}
{"bases": [{"id": "BASE001", "name": "Sales CRM"}, {"id": "BASE004", "name": "Project Management"}, ...]}

Step 2 — try to fetch the (not-yet-existing) summary fileReasoning:
Great, I have all four facts: Austin (30.285, -97.734); mustangrestore.com is registered with GoDaddy (created 2013-04-13, expires 2027); budget $11,000; and the Project Management base exists. Let me check the summary file the task implies.
Tool calls:
[{"name": "ddg-search__fetch_content", "arguments": {"url": "file:///.../mustang_restoration_tracker.md"}}]
Tool results:
{"url": "file:///.../mustang_restoration_tracker.md", "note": "content fetch not in grounded corpus; use the search result snippets"}

Step 3 — recognize the limitation, pivot to the Airtable baseReasoning:
The fetch_content tool doesn’t work for local files, and I don’t see a direct file-write tool in the available functions. Let me inspect the Project Management base (BASE004) instead.
Tool calls:
[{"name": "airtable__list_tables", "arguments": {"base_id": "BASE004"}}]
Tool results:
{"tables": [{"id": "tbl_BASE004_0", "name": "Tasks"}, {"name": "Restoration Milestones"}, ...]}

Step 4 — write the kickoff record into the confirmed baseReasoning:
The Project Management base (BASE004) has a “Restoration Milestones” table, which is relevant for tracking the project. I’ll create the kickoff record there.
Tool calls:
[{"name": "airtable__create_record", "arguments": {
  "base_id": "BASE004", "table_name": "Restoration Milestones",
  "fields": {"field_a": "1967 Ford Mustang restoration kickoff", ...}}}]
Tool results:
{"record": {"id": "rec1e1a5a06ac", "fields": {"field_a": "1967 Ford Mustang restoration kickoff", ...}}}

Step 5 — final answerReasoning:
I’ve created the record. The task asked for a “project tracker summary”; since there’s no file-write tool, I’ll deliver the summary directly to the user.
Content (content, user-visible):
I’ve completed all the required tasks for your 1967 Ford Mustang restoration project tracker:
  • Austin coordinates: 30.2851, -97.7339
  • Domain: mustangrestore.com is registered (GoDaddy, created 2013-04-13, status clientDeleteProhibited) — legitimate
  • Preliminary budget: $11,000
  • Project Management base (BASE004) confirmed; a kickoff record was created in the Restoration Milestones table
Note the agent (a) fans out the four independent reads in one step, (b) discovers BASE004 via list_bases before writing to it, and (c) when fetch_content can’t serve a local file, states the limitation and pivots rather than fabricating a written file.
claims.json ships the per-claim verdict the grader produced — all four grounded in real tool output:
{ "grades": [
  { "id": "c1", "pass": true, "reasoning": "Step 1 geocoded Austin, Texas → (30.285, -97.734), matching ≈30.27 / 97.74." },
  { "id": "c2", "pass": true, "reasoning": "Step 2 WHOIS for mustangrestore.com shows Domain Status incl. clientDeleteProhibited." },
  { "id": "c3", "pass": true, "reasoning": "Step 3 calculator output is exactly 11000 for 5000+3500+2000+500." },
  { "id": "c4", "pass": true, "reasoning": "Step 4 airtable list_bases shows a base named 'Project Management' (BASE004)." }
] }
coverage = 4/4 = 1.0 (all_pass=true). The realized world — the Airtable kickoff record — is captured in the shared fixture; whois / osm / calculator are stateless replayable_servers served from real frozen responses, so env/workspace_init carries the sandbox baseline an RL rollout restores from.

How to consume

SFTtrajectory.json is ready-to-train OpenAI-style messages (system / user / assistant-with-reasoning_content+tool_calls / tool). The action space (every tool of every server the task touches) comes from the benchmark’s upstream tool schemas. RL — restore env/workspace_init/ into a fresh per-task sandbox, let your policy act, then grade its trajectory against claims.json (coverage ≥ 0.75 = pass). env.jsonl provides the reference tool I/O for replay or for seeding the rollout; env/workspace_final/ is the golden end state for write tasks.

Verification

Every bundle was delivered by a grader, not by signal alone: the agent re-solved the task in a fresh per-task sandbox, and a per-claim LLM judge scored its trajectory. A bundle ships only when coverage ≥ 0.75 (the MCP-Atlas pass threshold) and no tool returned a not-implemented / broken result. The manifest records the real coverage and an all_pass flag, so a strict consumer can post-filter to 100%-claim bundles.

Provenance

  • Tasks are proposed by an LLM grounded in the live tool graph, then really executed against the local MCP mirror — tool calls and arguments are grounded in actual tool responses, not invented.
  • Tool data is real. Mocked servers serve frozen real API responses (e.g. real Met objects, real Wikipedia articles, real GitHub repos); on a cache miss the read-through layer fetches the genuine upstream and appends it to the corpus. Mock tool schemas are byte-identical to the benchmark’s upstream servers, so a trajectory transfers to the live eval unchanged.
  • Claims are factual and non-time-sensitive (no “current price” / “today’s” style claims), extracted from the executed trajectory and graded against it.

Download

# Download MCP-Atlas demo bundles
hf download jindidi/eigendata-demo-data --repo-type dataset --include "mcpatlas/*"

Browse on Hugging Face

View MCP-Atlas files
For the complete MCP-Atlas corpus — its scale, the full server graph, and commercial licensing — see the Full Dataset page.