- Filesystem — the agent reads, correlates, and writes files through a filesystem MCP server. Demo world:
travel_agent(Sorrentino Bespoke Travel). - GitHub — the agent drives a GitHub MCP server (commits, files, branches, issues, PRs, labels) against a snapshot of a real repository. Demo world:
gif-for-cli(google/gif-for-cli).
Overview
| Property | Value |
|---|---|
| Domains | Filesystem, GitHub |
| Tasks | 20 total — 10 Filesystem + 10 GitHub |
| Tooling | MCP servers (filesystem tools / GitHub API tools) |
| Grading | Per-task verify.py — deterministic, programmatic pass/fail |
| Difficulty | L1 (single-hop) → L3 (multi-step, cross-source correlation) |
| Trajectories | One full solve per task, with reasoning_content (chain-of-thought) + tool calls |
Environment
Each domain contributes one world under its directory. A world bundles a reusable environment (env/), the tool schemas the agent may call (tools/), an environment descriptor (bundle.json), and a tasks/ folder where each task carries its spec, its verifier, and a reference trajectory.
Filesystem world — Sorrentino Bespoke Travel (travel_agent)
Filesystem world — Sorrentino Bespoke Travel (travel_agent)
Persona: Mia Sorrentino, a boutique travel planner. The agent works over her live working files.Contents:
- 8 trips under
trips/<TRIP_ID>/— each withitinerary.md(depart/return dates, base currency),manifests.json(flight + hotel, ISO-8601 with TZ offsets), and areceipts/subdir holding mixed-currency receipts (USD / EUR / GBP). vendor_directory.csv— 60 vendors across multiple cities and service tiers.crm_notes.md— client/account notes covering the last four months.inbox/— recent client emails (thank-you notes, refund requests, invoices, referrals).
GitHub world — google/gif-for-cli (gif-for-cli)
GitHub world — google/gif-for-cli (gif-for-cli)
Repository: a snapshot of
google/gif-for-cli (render GIFs/short videos as ASCII/256-color art in the terminal). Default branch master.Contents (materialised into the eval org by env/setup.py):repo/— the full source tree (33 tracked entries; 21.pyfiles undergif_for_cli/, plustests/,docs/,setup.py, CI config) with real commit history.issues.json— 20 issues.pulls.json— 8 pull requests, several with distinct head branches (pr/36-tjni-pillow-10,pr/34-SASUPERNOVA-issue-33-pt1, …).meta.json/world_profile.json— owner/repo/default-branch metadata and a structural profile of the tree.
Data
| Path | Description |
|---|---|
<domain>/<world>/bundle.json | Environment descriptor — runtime, interfaces, and refs to fixtures/tools/MCP config |
<domain>/<world>/env/ | The materialised world the agent acts on (filesystem tree, or repo snapshot + issues/pulls) |
<domain>/<world>/tools/tools.json | MCP tool schemas available to the agent for that domain |
tasks/<task>/task.json | Task spec: user_prompt, env ref, verification ref, limits, and metadata (difficulty, tags, MCP) |
tasks/<task>/verification/verify.py | Deterministic verifier — inlines the gold value and returns pass/fail |
tasks/<task>/trajectory.json | One full reference solve: tools, messages (with reasoning_content + tool_calls) |
Sample
All 20 tasks overview
All 20 tasks overview
Filesystem —
GitHub —
travel_agent (Sorrentino Bespoke Travel)| Task | Difficulty | Description |
|---|---|---|
aldridge_eur_thanked_receipts | L2 | Receipts for vendors named in a thank-you email, billed in base currency within the trip window |
aldridge_pref_mismatch | L3 | Thank-you vendors whose receipt currency mismatches their directory preference |
bramwell_driver_schedule_receipts | L2 | Driver-schedule receipts joined against the trip itinerary |
bramwell_premium_eur_receipts | L2 | Premium-tier vendor EUR receipt paths for the Bramwell offsite |
crm_amex_receipt_paths | L3 | Amex receipt paths for the CRM-flagged month via deep traversal |
crm_currency_mismatch_receipts | L3 | Receipts whose currency contradicts a CRM note |
della_rosa_wire_stay_receipts | L2 | Wire-paid stay receipts inside the refund/hotel window |
dellarosa_itinerary_dining_receipts | L2 | Dining receipts tied to itinerary stops for a refund case |
dual_city_pref_mismatches | L3 | Vendor-preference mismatches across two cities |
eu_premium_verified | L3 | EU premium-tier verified billing list (currency + window + directory join) |
gif-for-cli (google/gif-for-cli)| Task | Difficulty | Description |
|---|---|---|
gif-for-cli_L00 | L2 | Audit the commit that made the Tenor API key configurable |
gif-for-cli_L01 | L2 | Label open PRs sharing generate/__init__.py with the _save_config PR |
gif-for-cli_L03 | L3 | Document unmerged gif_for_cli/export.py proposals |
gif-for-cli_L04 | L2 | Coordinate CI-bootstrap follow-up with the unique open style-checking PR |
gif-for-cli_L05 | L3 | Document export-pipeline provenance (commit ↔ PR ↔ issue ↔ branch) |
gif-for-cli_L06 | L1 | Count gif_for_cli/ Python files modified by the 256fgbg commit |
gif-for-cli_L07 | L1 | Find when rows was added to TestRunFfmpeg |
gif-for-cli_L08 | L1 | Find the UTC date of a test_256 assertion change |
gif-for-cli_L09 | L2 | Audit the README-grooming commit with derived provenance facts |
gif-for-cli_L10 | L3 | Document an execute.py bug audit (issues + PR + branches) |
Full task: thanked-vendor receipts (filesystem — aldridge_eur_thanked_receipts)
Full task: thanked-vendor receipts (filesystem — aldridge_eur_thanked_receipts)
Prompt:
The Aldridge family sent a thank-you note after their Paris trip (Why it’s hard: four AND-conditions, each grounded in a different file (email body, receipt header, itinerary). A naive substring match on the whole email (including the subject line) or a left-to-right field split picks up the wrong vendors; the date and currency gates must be applied with full precision.Grading (TRP-2025-001). I need the receipt files we can tie to vendors they actually name in that email, but only where we billed in the trip’s base currency during the on-trip window. Work only undertrips/TRP-2025-001/receipts/. Readinbox/aldridge_thank_you.txtandtrips/TRP-2025-001/itinerary.mdfor the rules below. A receipt’s vendor counts as thanked when either (a) the vendor’s exactVendor:name from the receipt appears in the email body, or (b) the first whitespace-delimited token of that vendor name appears in the email body. Use the email body only. Keep a receipt only when all hold: (1) itsTrip ID:isTRP-2025-001; (2) the vendor is thanked; (3) the receiptDate:is on/after the itinerary Depart date and on/before the Return date; (4) the receiptCurrency:exactly matches the itinerary Base currency: line. Write matching paths toaldridge_eur_thanked_receipts.txtat the filesystem root: one relative path per line, sorted ascending, no header. Do not use python code. Do not modify existing files.
verification/verify.py): reads the written aldridge_eur_thanked_receipts.txt, compares the exact sorted path set against the inlined gold, and returns pass/fail.Full agent trajectory (github — gif-for-cli_L06, 3 tool calls)
Full agent trajectory (github — gif-for-cli_L06, 3 tool calls)
Each assistant message carries Tool result:Step 2 — Inspect the commit diff:Found it — SHA Tool result (files changed):Counting paths that start with Tool result:
reasoning_content (chain-of-thought) alongside its spoken message and tool calls. Reasoning is shown in italics below. This is the simplest demo task (an L1 single-commit count) so the full solve fits in one view.System:You are an agent that completes tasks independently. Use the tools provided to you to complete the task to the best of your ability.User:
Please execute the following task in my repositoryStep 1 — Locate the commit:The user wants the commit authored by Sean Hayes with subject exactly “Added 256fgbg display mode.” Let me list commits to find its SHA.mcpmark-eval-synthetic-tasks/gif-for-cli: I’m auditing the scope of the commit that introduced the 256 foreground/background color display mode. Find the commit authored by Sean Hayes whose subject is exactlyAdded 256fgbg display mode.and determine how many Python files under thegif_for_cli/package directory (paths starting withgif_for_cli/and ending with.py) it modified. Do not count files intests/. WriteANSWER.mdonmastercontaining exactly one lineCOUNT=<number>.
I’ll start by searching for the commit authored by Sean Hayes with the exact subject.
5f35ae5…. Now I need the per-file diff to count the .py files under gif_for_cli/, excluding tests/.gif_for_cli/ and end with .py, excluding tests/: 6 files. The distractors — 10 total files changed, 9 Python files including tests — must not be counted.Step 3 — Write the answer:{"content": {"name": "ANSWER.md", "path": "ANSWER.md", ...}}Final answer:Task completed. The commit5f35ae5…authored by Sean Hayes (“Added 256fgbg display mode.”) modified 6 Python files undergif_for_cli/(excludingtests/). WroteCOUNT=6toANSWER.mdonmaster.
Grading — deterministic verifiers (verify.py)
Grading — deterministic verifiers (verify.py)
Every task is graded by a self-contained Filesystem verifiers read the file the agent was asked to write (e.g.
verify.py that inlines the gold value (it does not read any facts.json at grade time) and checks the agent’s effect on the world. For gif-for-cli_L06:aldridge_eur_thanked_receipts.txt) and compare the exact, sorted answer set against gold. GitHub verifiers re-query the live repo via the API (branch exists, file contents, PR title/body, labels) and assert each required effect. A task passes only if all checks hold.Download
Browse on Hugging Face
View MCPMark files
This is a free 20-task sample. The full MCPMark suite spans both domains across dozens of worlds (627 filesystem tasks, 606 github tasks), each with deterministic verifiers and reference SFT trajectories — see the Full Dataset page.