Skip to main content
MCPMark is a corpus of synthetic, agentic tasks designed to stress two MCP services end-to-end: a local filesystem sandbox and a GitHub repository sandbox. Each task is a single prompt that asks the agent to produce a precise deliverable — a CSV written into a directory tree, a pull request that closes a tracking issue, an audit document committed on a specific branch — paired with a virtual world that mounts the inputs required to do the work. The bundle ships every world’s initial state, so the eval runs fully offline once unpacked.
Want to try it first? A free 20-task sample (10 filesystem + 10 GitHub) is available on the Demo Samples page.

What MCPMark is

MCPMark tasks fall into two streams:
  • filesystem — operate over a directory tree of csv, md, txt, and similar text files. Heavy on cross-file joins, defensive CSV parsing, exact output-format compliance, and date / boundary correctness.
  • github — operate over a repository snapshot that includes the full git history, branches, tags, issues, and pull requests. Tasks chain repo archaeology (commits, file changes, releases) with stateful GitHub actions (opening issues, creating branches, filing PRs that close issues).

At a glance

PropertyValue
Total tasks1,233
Worlds (environments)95 (45 filesystem + 50 github)
Sample compositiontask + grader + world initial state
Workspace typesdirectory trees of text/csv/md; full git repos w/ history, issues, PRs
Agent MCPsfilesystem, github
GradingPer-task verify.py (deterministic Python checker over final world state)
Reward signalPass / fail per task; aggregate as pass-rate
Suitable forTool-use SFT (with trajectories) · RL (verifier as reward) · benchmark

What’s inside

ComponentDescription
WorldsSelf-contained initial state. For filesystem, a directory tree (env/). For github, an env/ directory holding the git working tree (repo/ incl. .git/) plus serialized issues.json and pulls.json (issues, PR reviews, comments, base/head refs, labels, state). Multiple tasks share one world.
TasksOne prompt per task, written as a realistic user request rather than a stripped-down instruction. Prompts are long and specification-grade (median ~2,000 characters), pinning down field order, rounding, tie-breakers, and acceptable inputs in detail.
Reward verifiersA verify.py per task: an executable grader that reads the resulting world state and returns pass/fail. Verifiers are substantial — median 5.8 KB (filesystem) / 8.1 KB (github), some exceeding 40 KB for tasks with many rubric checks.
MetadataA meta.json per task recording task_id, parent world, difficulty tier, tags, the MCP server(s) required, and a tree-view of the initial workspace.
Per-stream component counts:
Componentfilesystemgithub
Worlds4550
Tasks627606
Median prompt length (chars)2,0161,818
Median verifier length (chars)5,8018,056

Difficulty profile

Each task is tagged with a difficulty tier in meta.json:
TierDefinition
L1Single-step retrieval / single-action workflow (github only).
L2Multi-step but local: one file or one repo subsystem (commits OR issues OR files).
L3Multi-step across subsystems: cross-file joins, derive-then-act, multi-PR workflows.
TierfilesystemgithubTotalShare
L1019919916.1%
L228519548038.9%
L334221255444.9%
Total6276061,233100%
Filesystem skews harder by construction — the data layout itself rewards cross-file work, so easy single-lookup tasks aren’t included.

What makes these tasks hard

  • Specifications are tight. Prompts pin down field order, rounding, tie-breakers (lex-ascending on activity string), and inclusion boundaries (< pivot vs >= pivot). A single off-by-one on a date or a single rounding mistake fails the verifier.
  • Inputs are realistically messy. CSVs contain unescaped commas inside free-text columns; date formats vary; some tasks require ignoring near-synonym labels (Filed: vs Answer filed: vs Next hearing:) and using only the one explicitly named.
  • Verifiers are strict. Most graders check exact final state — exact file bytes, exact branch names, exact issue / PR closure relationships. There is no LLM judge in the loop; pass/fail is deterministic.
  • GitHub state is full-fidelity. Tasks that close an issue with a PR need real base/head refs, real comments, real labels in the seed state — not a simplified mock — because the verifier checks the resulting GitHub state for those relationships.

Trajectory length

Computed from one passing trajectory per task (1,233 total — full coverage). Assistant steps = number of assistant messages; tool calls = sum of tool_calls across all assistant messages. Steps per trajectory:
Servicenmedianmeanp90max
filesystem627910.81841
github6061315.12761
all1,2331112.92361
Tool calls per trajectory:
Servicenmedianmeanp90max
filesystem6271014.828148
github6061618.431101
all1,2331216.631148
GitHub trajectories run noticeably longer; the long tail comes from deep-derive workflows that walk many commits/files before acting.

Tool usage

The two streams exercise different tools. Most-called (top 5):
  • filesystemread_text_file · list_directory · read_multiple_files · write_file · get_file_info.
  • githubget_commit · get_file_contents · list_commits · create_or_update_file · get_pull_request_files.
The github top-5 reflects the corpus’s emphasis on commit-history archaeology before any state-changing action.

Training utility

Supervised fine-tuning (SFT) a smaller open-weight model on MCPMark trajectories yields strict-pass gains on held-out MCPMark tasks. Evaluation covers the Filesystem (30 tasks) and GitHub (23 tasks) slices, each graded by its programmatic verify.py (all checks must pass). Training data. Synthetic Filesystem + GitHub trajectories in single-turn, multi-step messages format with per-turn reasoning_content. The system prompt inlines the MCP tool signatures (e.g. filesystem_read_text_file, filesystem_list_directory, filesystem_write_file), so the model learns to call the same MCP tools the benchmark exposes. The Filesystem-best mix is ~300 samples (≈150 fs + 150 gh, 33% reasoning); the GitHub-best mix scales to 600 fs + 250 gh (100% reasoning). Horizon is short-to-mid — typically ~5–8 assistant turns and a handful of tool calls, with longer chains for the multi-step workflows. Base model: Qwen3.6-27B. Metric. Strict pass@1 (all verification checks pass), temperature = 1.0. Results.
SliceBaselineBest SFTLiftConfig
Filesystem33% (10/30)40% (12/30)+7 pts150 fs + 150 gh, 33% reasoning
GitHub30% (7/23)39% (9/23)+9 pts600 fs + 250 gh, 100% reasoning
What drove it — GitHub. The newly solved tasks are the multi-step, write-heavy workflows that require chaining several repository writes in the right order. Two representative wins: (1) conflict resolution — finding the one open pull request that won’t merge (mergeable: false, state dirty) and unblocking it by creating the missing file it depends on; and (2) issue → PR → commit workflow — opening a bug issue, raising a pull request that fixes it, and committing the fix with correct cross-references between all three. The baseline tends to stall or misorder steps partway through these long chains; the SFT trajectories demonstrate the correct decomposition and tool-call sequence, teaching the model to carry the chain through to completion. What drove it — Filesystem. The newly solved tasks are the structured-analysis / report-generation ones — reading and parsing files, computing or aggregating an answer, and emitting a precisely formatted output. Two representative wins: (1) compute-and-report — reading per-song data from several folders, calculating a popularity score for each from a given formula (to 3 decimals), and writing out a correctly ranked report; and (2) fuzzy retrieval — identifying a specific math-benchmark paper from only a vague description, then renaming the matching file. These reward an exact final artifact, and the SFT trajectories model the full “parse → compute → format-and-write” pattern, improving the precision of the end result.

Access & licensing

A free 20-task sample is available now under CC BY-NC-ND 4.0 — see Demo Samples. The full corpus — all worlds, tasks, verifiers, and reference trajectories — is available for commercial licensing; contact support@eigenai.com.