Skip to main content
These demo samples are a free 20-task slice of the MCPMark dataset — a synthetic agentic-eval suite where an agent operates a real MCP (Model Context Protocol) tool server to complete grounded, verifiable tasks. The full dataset spans two domains; this demo ships 10 tasks from each:
  • Filesystem — the agent reads, correlates, and writes files through a filesystem MCP server. Demo world: travel_agent (Sorrentino Bespoke Travel).
  • GitHub — the agent drives a GitHub MCP server (commits, files, branches, issues, PRs, labels) against a snapshot of a real repository. Demo world: gif-for-cli (google/gif-for-cli).
Every task ships with its environment (the world the agent acts on), its task spec (prompt + verifier), and a full agent trajectory (chain-of-thought reasoning + tool calls) showing one successful solve.

Overview

PropertyValue
DomainsFilesystem, GitHub
Tasks20 total — 10 Filesystem + 10 GitHub
ToolingMCP servers (filesystem tools / GitHub API tools)
GradingPer-task verify.py — deterministic, programmatic pass/fail
DifficultyL1 (single-hop) → L3 (multi-step, cross-source correlation)
TrajectoriesOne full solve per task, with reasoning_content (chain-of-thought) + tool calls

Environment

Each domain contributes one world under its directory. A world bundles a reusable environment (env/), the tool schemas the agent may call (tools/), an environment descriptor (bundle.json), and a tasks/ folder where each task carries its spec, its verifier, and a reference trajectory.
mcpmark/
├── filesystem/
│   └── travel_agent/
│       ├── bundle.json              # environment descriptor (runtime, refs)
│       ├── env/
│       │   └── fixtures/            # the materialised world (the agent's filesystem)
│       │       ├── README.md, crm_notes.md, vendor_directory.csv
│       │       ├── inbox/           # recent client emails (.txt)
│       │       └── trips/           # 8 trips: itinerary.md, manifests.json, receipts/
│       ├── tools/
│       │   └── tools.json           # filesystem MCP tool schemas
│       └── tasks/                   # 10 tasks (see below)
│           └── travel_agent__<name>/
│               ├── task.json            # prompt, env ref, verifier ref, limits, metadata
│               ├── verification/verify.py
│               └── trajectory.json      # full reference solve
├── github/
│   └── gif-for-cli/
│       ├── bundle.json
│       ├── env/
│       │   ├── setup.py             # materialises the repo into the eval org
│       │   └── fixtures/            # repo/ tree + issues.json + pulls.json + meta.json + world_profile.json
│       ├── tools/
│       │   └── tools.json           # GitHub MCP tool schemas
│       └── tasks/                   # 10 tasks (see below)
│           └── gif-for-cli__<name>/
│               ├── task.json
│               ├── verification/verify.py
│               └── trajectory.json
Persona: Mia Sorrentino, a boutique travel planner. The agent works over her live working files.Contents:
  • 8 trips under trips/<TRIP_ID>/ — each with itinerary.md (depart/return dates, base currency), manifests.json (flight + hotel, ISO-8601 with TZ offsets), and a receipts/ subdir holding mixed-currency receipts (USD / EUR / GBP).
  • vendor_directory.csv — 60 vendors across multiple cities and service tiers.
  • crm_notes.md — client/account notes covering the last four months.
  • inbox/ — recent client emails (thank-you notes, refund requests, invoices, referrals).
Tasks require cross-file joins (email → vendor directory → receipts → itinerary), defensive parsing (currency tags, comma-bearing fields, timezone-aware dates), and exhaustive scans, then writing a structured answer file back to the filesystem root.
Repository: a snapshot of google/gif-for-cli (render GIFs/short videos as ASCII/256-color art in the terminal). Default branch master.Contents (materialised into the eval org by env/setup.py):
  • repo/ — the full source tree (33 tracked entries; 21 .py files under gif_for_cli/, plus tests/, docs/, setup.py, CI config) with real commit history.
  • issues.json — 20 issues.
  • pulls.json — 8 pull requests, several with distinct head branches (pr/36-tjni-pillow-10, pr/34-SASUPERNOVA-issue-33-pt1, …).
  • meta.json / world_profile.json — owner/repo/default-branch metadata and a structural profile of the tree.
Tasks require code archaeology (tracing a fix through commit history), provenance correlation (commit ↔ PR ↔ issue ↔ branch), and mutations (cutting branches, writing report files, opening PRs, applying labels) — all through the GitHub MCP tools.

Data

PathDescription
<domain>/<world>/bundle.jsonEnvironment descriptor — runtime, interfaces, and refs to fixtures/tools/MCP config
<domain>/<world>/env/The materialised world the agent acts on (filesystem tree, or repo snapshot + issues/pulls)
<domain>/<world>/tools/tools.jsonMCP tool schemas available to the agent for that domain
tasks/<task>/task.jsonTask spec: user_prompt, env ref, verification ref, limits, and metadata (difficulty, tags, MCP)
tasks/<task>/verification/verify.pyDeterministic verifier — inlines the gold value and returns pass/fail
tasks/<task>/trajectory.jsonOne full reference solve: tools, messages (with reasoning_content + tool_calls)

Sample

Filesystem — travel_agent (Sorrentino Bespoke Travel)
TaskDifficultyDescription
aldridge_eur_thanked_receiptsL2Receipts for vendors named in a thank-you email, billed in base currency within the trip window
aldridge_pref_mismatchL3Thank-you vendors whose receipt currency mismatches their directory preference
bramwell_driver_schedule_receiptsL2Driver-schedule receipts joined against the trip itinerary
bramwell_premium_eur_receiptsL2Premium-tier vendor EUR receipt paths for the Bramwell offsite
crm_amex_receipt_pathsL3Amex receipt paths for the CRM-flagged month via deep traversal
crm_currency_mismatch_receiptsL3Receipts whose currency contradicts a CRM note
della_rosa_wire_stay_receiptsL2Wire-paid stay receipts inside the refund/hotel window
dellarosa_itinerary_dining_receiptsL2Dining receipts tied to itinerary stops for a refund case
dual_city_pref_mismatchesL3Vendor-preference mismatches across two cities
eu_premium_verifiedL3EU premium-tier verified billing list (currency + window + directory join)
GitHub — gif-for-cli (google/gif-for-cli)
TaskDifficultyDescription
gif-for-cli_L00L2Audit the commit that made the Tenor API key configurable
gif-for-cli_L01L2Label open PRs sharing generate/__init__.py with the _save_config PR
gif-for-cli_L03L3Document unmerged gif_for_cli/export.py proposals
gif-for-cli_L04L2Coordinate CI-bootstrap follow-up with the unique open style-checking PR
gif-for-cli_L05L3Document export-pipeline provenance (commit ↔ PR ↔ issue ↔ branch)
gif-for-cli_L06L1Count gif_for_cli/ Python files modified by the 256fgbg commit
gif-for-cli_L07L1Find when rows was added to TestRunFfmpeg
gif-for-cli_L08L1Find the UTC date of a test_256 assertion change
gif-for-cli_L09L2Audit the README-grooming commit with derived provenance facts
gif-for-cli_L10L3Document an execute.py bug audit (issues + PR + branches)
Prompt:
The Aldridge family sent a thank-you note after their Paris trip (TRP-2025-001). I need the receipt files we can tie to vendors they actually name in that email, but only where we billed in the trip’s base currency during the on-trip window. Work only under trips/TRP-2025-001/receipts/. Read inbox/aldridge_thank_you.txt and trips/TRP-2025-001/itinerary.md for the rules below. A receipt’s vendor counts as thanked when either (a) the vendor’s exact Vendor: name from the receipt appears in the email body, or (b) the first whitespace-delimited token of that vendor name appears in the email body. Use the email body only. Keep a receipt only when all hold: (1) its Trip ID: is TRP-2025-001; (2) the vendor is thanked; (3) the receipt Date: is on/after the itinerary Depart date and on/before the Return date; (4) the receipt Currency: exactly matches the itinerary Base currency: line. Write matching paths to aldridge_eur_thanked_receipts.txt at the filesystem root: one relative path per line, sorted ascending, no header. Do not use python code. Do not modify existing files.
Why it’s hard: four AND-conditions, each grounded in a different file (email body, receipt header, itinerary). A naive substring match on the whole email (including the subject line) or a left-to-right field split picks up the wrong vendors; the date and currency gates must be applied with full precision.Grading (verification/verify.py): reads the written aldridge_eur_thanked_receipts.txt, compares the exact sorted path set against the inlined gold, and returns pass/fail.
Each assistant message carries reasoning_content (chain-of-thought) alongside its spoken message and tool calls. Reasoning is shown in italics below. This is the simplest demo task (an L1 single-commit count) so the full solve fits in one view.System:
You are an agent that completes tasks independently. Use the tools provided to you to complete the task to the best of your ability.
User:
Please execute the following task in my repository mcpmark-eval-synthetic-tasks/gif-for-cli: I’m auditing the scope of the commit that introduced the 256 foreground/background color display mode. Find the commit authored by Sean Hayes whose subject is exactly Added 256fgbg display mode. and determine how many Python files under the gif_for_cli/ package directory (paths starting with gif_for_cli/ and ending with .py) it modified. Do not count files in tests/. Write ANSWER.md on master containing exactly one line COUNT=<number>.
Step 1 — Locate the commit:The user wants the commit authored by Sean Hayes with subject exactly “Added 256fgbg display mode.” Let me list commits to find its SHA.
I’ll start by searching for the commit authored by Sean Hayes with the exact subject.
github_list_commits({"owner": "mcpmark-eval-synthetic-tasks", "repo": "gif-for-cli"})
Tool result:
[{"sha": "...", "commit": {"message": "Version ..."}}, ..., {"sha": "5f35ae5dfb5191c282f7c6cbfee6b4ae189645ae", "commit": {"message": "Added 256fgbg display mode."}}]
Step 2 — Inspect the commit diff:Found it — SHA 5f35ae5…. Now I need the per-file diff to count the .py files under gif_for_cli/, excluding tests/.
github_get_commit({"owner": "mcpmark-eval-synthetic-tasks", "repo": "gif-for-cli", "ref": "5f35ae5dfb5191c282f7c6cbfee6b4ae189645ae"})
Tool result (files changed):
README.md                        (not .py)
gif_for_cli/__init__.py          ✓
gif_for_cli/constants.py         ✓
gif_for_cli/execute.py           ✓
gif_for_cli/...                  ✓  (6 total under gif_for_cli/*.py)
tests/test_*.py                  ✗  (excluded — tests/)
Counting paths that start with gif_for_cli/ and end with .py, excluding tests/: 6 files. The distractors — 10 total files changed, 9 Python files including tests — must not be counted.Step 3 — Write the answer:
github_create_or_update_file({"owner": "...", "repo": "gif-for-cli", "branch": "master", "path": "ANSWER.md", "content": "COUNT=6\n", "message": "Add audit count"})
Tool result: {"content": {"name": "ANSWER.md", "path": "ANSWER.md", ...}}Final answer:
Task completed. The commit 5f35ae5… authored by Sean Hayes (“Added 256fgbg display mode.”) modified 6 Python files under gif_for_cli/ (excluding tests/). Wrote COUNT=6 to ANSWER.md on master.
Every task is graded by a self-contained verify.py that inlines the gold value (it does not read any facts.json at grade time) and checks the agent’s effect on the world. For gif-for-cli_L06:
Gold (inlined): COUNT = 6
Distractors that would trip a naive agent:
  total files changed            = 10
  all Python files (incl. tests/) = 9
Checks:
  1. ANSWER.md exists on master
  2. Content contains exactly "COUNT=6"
Filesystem verifiers read the file the agent was asked to write (e.g. aldridge_eur_thanked_receipts.txt) and compare the exact, sorted answer set against gold. GitHub verifiers re-query the live repo via the API (branch exists, file contents, PR title/body, labels) and assert each required effect. A task passes only if all checks hold.

Download

# Download MCPMark demo data (filesystem + github)
hf download jindidi/eigendata-demo-data --repo-type dataset --include "mcpmark/*"

Browse on Hugging Face

View MCPMark files
This is a free 20-task sample. The full MCPMark suite spans both domains across dozens of worlds (627 filesystem tasks, 606 github tasks), each with deterministic verifiers and reference SFT trajectories — see the Full Dataset page.