Want to try it first? A free 20-task sample (10 filesystem + 10 GitHub) is available on the Demo Samples page.
What MCPMark is
MCPMark tasks fall into two streams:- filesystem — operate over a directory tree of
csv,md,txt, and similar text files. Heavy on cross-file joins, defensive CSV parsing, exact output-format compliance, and date / boundary correctness. - github — operate over a repository snapshot that includes the full git history, branches, tags, issues, and pull requests. Tasks chain repo archaeology (commits, file changes, releases) with stateful GitHub actions (opening issues, creating branches, filing PRs that close issues).
At a glance
| Property | Value |
|---|---|
| Total tasks | 1,233 |
| Worlds (environments) | 95 (45 filesystem + 50 github) |
| Sample composition | task + grader + world initial state |
| Workspace types | directory trees of text/csv/md; full git repos w/ history, issues, PRs |
| Agent MCPs | filesystem, github |
| Grading | Per-task verify.py (deterministic Python checker over final world state) |
| Reward signal | Pass / fail per task; aggregate as pass-rate |
| Suitable for | Tool-use SFT (with trajectories) · RL (verifier as reward) · benchmark |
What’s inside
| Component | Description |
|---|---|
| Worlds | Self-contained initial state. For filesystem, a directory tree (env/). For github, an env/ directory holding the git working tree (repo/ incl. .git/) plus serialized issues.json and pulls.json (issues, PR reviews, comments, base/head refs, labels, state). Multiple tasks share one world. |
| Tasks | One prompt per task, written as a realistic user request rather than a stripped-down instruction. Prompts are long and specification-grade (median ~2,000 characters), pinning down field order, rounding, tie-breakers, and acceptable inputs in detail. |
| Reward verifiers | A verify.py per task: an executable grader that reads the resulting world state and returns pass/fail. Verifiers are substantial — median 5.8 KB (filesystem) / 8.1 KB (github), some exceeding 40 KB for tasks with many rubric checks. |
| Metadata | A meta.json per task recording task_id, parent world, difficulty tier, tags, the MCP server(s) required, and a tree-view of the initial workspace. |
| Component | filesystem | github |
|---|---|---|
| Worlds | 45 | 50 |
| Tasks | 627 | 606 |
| Median prompt length (chars) | 2,016 | 1,818 |
| Median verifier length (chars) | 5,801 | 8,056 |
Difficulty profile
Each task is tagged with a difficulty tier inmeta.json:
| Tier | Definition |
|---|---|
| L1 | Single-step retrieval / single-action workflow (github only). |
| L2 | Multi-step but local: one file or one repo subsystem (commits OR issues OR files). |
| L3 | Multi-step across subsystems: cross-file joins, derive-then-act, multi-PR workflows. |
| Tier | filesystem | github | Total | Share |
|---|---|---|---|---|
| L1 | 0 | 199 | 199 | 16.1% |
| L2 | 285 | 195 | 480 | 38.9% |
| L3 | 342 | 212 | 554 | 44.9% |
| Total | 627 | 606 | 1,233 | 100% |
What makes these tasks hard
- Specifications are tight. Prompts pin down field order, rounding, tie-breakers (lex-ascending on activity string), and inclusion boundaries (
< pivotvs>= pivot). A single off-by-one on a date or a single rounding mistake fails the verifier. - Inputs are realistically messy. CSVs contain unescaped commas inside free-text columns; date formats vary; some tasks require ignoring near-synonym labels (
Filed:vsAnswer filed:vsNext hearing:) and using only the one explicitly named. - Verifiers are strict. Most graders check exact final state — exact file bytes, exact branch names, exact issue / PR closure relationships. There is no LLM judge in the loop; pass/fail is deterministic.
- GitHub state is full-fidelity. Tasks that close an issue with a PR need real base/head refs, real comments, real labels in the seed state — not a simplified mock — because the verifier checks the resulting GitHub state for those relationships.
Trajectory length
Computed from one passing trajectory per task (1,233 total — full coverage). Assistant steps = number ofassistant messages; tool calls = sum of tool_calls across all assistant messages.
Steps per trajectory:
| Service | n | median | mean | p90 | max |
|---|---|---|---|---|---|
| filesystem | 627 | 9 | 10.8 | 18 | 41 |
| github | 606 | 13 | 15.1 | 27 | 61 |
| all | 1,233 | 11 | 12.9 | 23 | 61 |
| Service | n | median | mean | p90 | max |
|---|---|---|---|---|---|
| filesystem | 627 | 10 | 14.8 | 28 | 148 |
| github | 606 | 16 | 18.4 | 31 | 101 |
| all | 1,233 | 12 | 16.6 | 31 | 148 |
Tool usage
The two streams exercise different tools. Most-called (top 5):- filesystem —
read_text_file·list_directory·read_multiple_files·write_file·get_file_info. - github —
get_commit·get_file_contents·list_commits·create_or_update_file·get_pull_request_files.
Training utility
Supervised fine-tuning (SFT) a smaller open-weight model on MCPMark trajectories yields strict-pass gains on held-out MCPMark tasks. Evaluation covers the Filesystem (30 tasks) and GitHub (23 tasks) slices, each graded by its programmaticverify.py (all checks must pass).
Training data. Synthetic Filesystem + GitHub trajectories in single-turn, multi-step messages format with per-turn reasoning_content. The system prompt inlines the MCP tool signatures (e.g. filesystem_read_text_file, filesystem_list_directory, filesystem_write_file), so the model learns to call the same MCP tools the benchmark exposes. The Filesystem-best mix is ~300 samples (≈150 fs + 150 gh, 33% reasoning); the GitHub-best mix scales to 600 fs + 250 gh (100% reasoning). Horizon is short-to-mid — typically ~5–8 assistant turns and a handful of tool calls, with longer chains for the multi-step workflows. Base model: Qwen3.6-27B.
Metric. Strict pass@1 (all verification checks pass), temperature = 1.0.
Results.
| Slice | Baseline | Best SFT | Lift | Config |
|---|---|---|---|---|
| Filesystem | 33% (10/30) | 40% (12/30) | +7 pts | 150 fs + 150 gh, 33% reasoning |
| GitHub | 30% (7/23) | 39% (9/23) | +9 pts | 600 fs + 250 gh, 100% reasoning |
mergeable: false, state dirty) and unblocking it by creating the missing file it depends on; and (2) issue → PR → commit workflow — opening a bug issue, raising a pull request that fixes it, and committing the fix with correct cross-references between all three. The baseline tends to stall or misorder steps partway through these long chains; the SFT trajectories demonstrate the correct decomposition and tool-call sequence, teaching the model to carry the chain through to completion.
What drove it — Filesystem. The newly solved tasks are the structured-analysis / report-generation ones — reading and parsing files, computing or aggregating an answer, and emitting a precisely formatted output. Two representative wins: (1) compute-and-report — reading per-song data from several folders, calculating a popularity score for each from a given formula (to 3 decimals), and writing out a correctly ranked report; and (2) fuzzy retrieval — identifying a specific math-benchmark paper from only a vague description, then renaming the matching file. These reward an exact final artifact, and the SFT trajectories model the full “parse → compute → format-and-write” pattern, improving the precision of the end result.