Full Dataset

MCPMark is a corpus of synthetic, agentic tasks designed to stress two MCP services end-to-end: a local filesystem sandbox and a GitHub repository sandbox. Each task is a single prompt that asks the agent to produce a precise deliverable — a CSV written into a directory tree, a pull request that closes a tracking issue, an audit document committed on a specific branch — paired with a virtual world that mounts the inputs required to do the work. The bundle ships every world’s initial state, so the eval runs fully offline once unpacked.

Want to try it first? A free 20-task sample (10 filesystem + 10 GitHub) is available on the Demo Samples page.

What MCPMark is

MCPMark tasks fall into two streams:

filesystem — operate over a directory tree of csv, md, txt, and similar text files. Heavy on cross-file joins, defensive CSV parsing, exact output-format compliance, and date / boundary correctness.
github — operate over a repository snapshot that includes the full git history, branches, tags, issues, and pull requests. Tasks chain repo archaeology (commits, file changes, releases) with stateful GitHub actions (opening issues, creating branches, filing PRs that close issues).

At a glance

Property	Value
Total tasks	1,233
Worlds (environments)	95 (45 filesystem + 50 github)
Sample composition	task + grader + world initial state
Workspace types	directory trees of text/csv/md; full git repos w/ history, issues, PRs
Agent MCPs	`filesystem`, `github`
Grading	Per-task `verify.py` (deterministic Python checker over final world state)
Reward signal	Pass / fail per task; aggregate as pass-rate
Suitable for	Tool-use SFT (with trajectories) · RL (verifier as reward) · benchmark

What’s inside

Component	Description
Worlds	Self-contained initial state. For filesystem, a directory tree (`env/`). For github, an `env/` directory holding the git working tree (`repo/` incl. `.git/`) plus serialized `issues.json` and `pulls.json` (issues, PR reviews, comments, base/head refs, labels, state). Multiple tasks share one world.
Tasks	One prompt per task, written as a realistic user request rather than a stripped-down instruction. Prompts are long and specification-grade (median ~2,000 characters), pinning down field order, rounding, tie-breakers, and acceptable inputs in detail.
Reward verifiers	A `verify.py` per task: an executable grader that reads the resulting world state and returns pass/fail. Verifiers are substantial — median 5.8 KB (filesystem) / 8.1 KB (github), some exceeding 40 KB for tasks with many rubric checks.
Metadata	A `meta.json` per task recording `task_id`, parent world, `difficulty` tier, `tags`, the MCP server(s) required, and a tree-view of the initial workspace.

Per-stream component counts:

Component	filesystem	github
Worlds	45	50
Tasks	627	606
Median prompt length (chars)	2,016	1,818
Median verifier length (chars)	5,801	8,056

Difficulty profile

Each task is tagged with a difficulty tier in meta.json:

Tier	Definition
L1	Single-step retrieval / single-action workflow (github only).
L2	Multi-step but local: one file or one repo subsystem (commits OR issues OR files).
L3	Multi-step across subsystems: cross-file joins, derive-then-act, multi-PR workflows.

Tier	filesystem	github	Total	Share
L1	0	199	199	16.1%
L2	285	195	480	38.9%
L3	342	212	554	44.9%
Total	627	606	1,233	100%

Filesystem skews harder by construction — the data layout itself rewards cross-file work, so easy single-lookup tasks aren’t included.

What makes these tasks hard

Specifications are tight. Prompts pin down field order, rounding, tie-breakers (lex-ascending on activity string), and inclusion boundaries (< pivot vs >= pivot). A single off-by-one on a date or a single rounding mistake fails the verifier.
Inputs are realistically messy. CSVs contain unescaped commas inside free-text columns; date formats vary; some tasks require ignoring near-synonym labels (Filed: vs Answer filed: vs Next hearing:) and using only the one explicitly named.
Verifiers are strict. Most graders check exact final state — exact file bytes, exact branch names, exact issue / PR closure relationships. There is no LLM judge in the loop; pass/fail is deterministic.
GitHub state is full-fidelity. Tasks that close an issue with a PR need real base/head refs, real comments, real labels in the seed state — not a simplified mock — because the verifier checks the resulting GitHub state for those relationships.

Trajectory length

Computed from one passing trajectory per task (1,233 total — full coverage). Assistant steps = number of assistant messages; tool calls = sum of tool_calls across all assistant messages. Steps per trajectory:

Service	n	median	mean	p90	max
filesystem	627	9	10.8	18	41
github	606	13	15.1	27	61
all	1,233	11	12.9	23	61

Tool calls per trajectory:

Service	n	median	mean	p90	max
filesystem	627	10	14.8	28	148
github	606	16	18.4	31	101
all	1,233	12	16.6	31	148

GitHub trajectories run noticeably longer; the long tail comes from deep-derive workflows that walk many commits/files before acting.

Tool usage

The two streams exercise different tools. Most-called (top 5):

filesystem — read_text_file · list_directory · read_multiple_files · write_file · get_file_info.
github — get_commit · get_file_contents · list_commits · create_or_update_file · get_pull_request_files.

The github top-5 reflects the corpus’s emphasis on commit-history archaeology before any state-changing action.

Training utility

Supervised fine-tuning (SFT) a smaller open-weight model on MCPMark trajectories yields strict-pass gains on held-out MCPMark tasks. Evaluation covers the Filesystem (30 tasks) and GitHub (23 tasks) slices, each graded by its programmatic verify.py (all checks must pass). Training data. Synthetic Filesystem + GitHub trajectories in single-turn, multi-step messages format with per-turn reasoning_content. The system prompt inlines the MCP tool signatures (e.g. filesystem_read_text_file, filesystem_list_directory, filesystem_write_file), so the model learns to call the same MCP tools the benchmark exposes. The Filesystem-best mix is ~300 samples (≈150 fs + 150 gh, 33% reasoning); the GitHub-best mix scales to 600 fs + 250 gh (100% reasoning). Horizon is short-to-mid — typically ~5–8 assistant turns and a handful of tool calls, with longer chains for the multi-step workflows. Base model: Qwen3.6-27B. Metric. Strict pass@1 (all verification checks pass), temperature = 1.0. Results.

Slice	Baseline	Best SFT	Lift	Config
Filesystem	33% (10/30)	40% (12/30)	+7 pts	150 fs + 150 gh, 33% reasoning
GitHub	30% (7/23)	39% (9/23)	+9 pts	600 fs + 250 gh, 100% reasoning

What drove it — GitHub. The newly solved tasks are the multi-step, write-heavy workflows that require chaining several repository writes in the right order. Two representative wins: (1) conflict resolution — finding the one open pull request that won’t merge (mergeable: false, state dirty) and unblocking it by creating the missing file it depends on; and (2) issue → PR → commit workflow — opening a bug issue, raising a pull request that fixes it, and committing the fix with correct cross-references between all three. The baseline tends to stall or misorder steps partway through these long chains; the SFT trajectories demonstrate the correct decomposition and tool-call sequence, teaching the model to carry the chain through to completion. What drove it — Filesystem. The newly solved tasks are the structured-analysis / report-generation ones — reading and parsing files, computing or aggregating an answer, and emitting a precisely formatted output. Two representative wins: (1) compute-and-report — reading per-song data from several folders, calculating a popularity score for each from a given formula (to 3 decimals), and writing out a correctly ranked report; and (2) fuzzy retrieval — identifying a specific math-benchmark paper from only a vague description, then renaming the matching file. These reward an exact final artifact, and the SFT trajectories model the full “parse → compute → format-and-write” pattern, improving the precision of the end result.

Access & licensing

A free 20-task sample is available now under CC BY-NC-ND 4.0 — see Demo Samples. The full corpus — all worlds, tasks, verifiers, and reference trajectories — is available for commercial licensing; contact support@eigenai.com.

Eigen AI

API Reference

Platform

Products

What MCPMark is

At a glance

What’s inside

Difficulty profile

What makes these tasks hard

Trajectory length

Tool usage

Training utility

Access & licensing

​What MCPMark is

​At a glance

​What’s inside

​Difficulty profile

​What makes these tasks hard

​Trajectory length

​Tool usage

​Training utility

​Access & licensing

What MCPMark is

At a glance

What’s inside

Difficulty profile

What makes these tasks hard

Trajectory length

Tool usage

Training utility

Access & licensing