Demo Samples

These demo samples are a free 20-task slice of the MCPMark dataset — a synthetic agentic-eval suite where an agent operates a real MCP (Model Context Protocol) tool server to complete grounded, verifiable tasks. The full dataset spans two domains; this demo ships 10 tasks from each:

Filesystem — the agent reads, correlates, and writes files through a filesystem MCP server. Demo world: travel_agent (Sorrentino Bespoke Travel).
GitHub — the agent drives a GitHub MCP server (commits, files, branches, issues, PRs, labels) against a snapshot of a real repository. Demo world: gif-for-cli (google/gif-for-cli).

Every task ships with its environment (the world the agent acts on), its task spec (prompt + verifier), and a full agent trajectory (chain-of-thought reasoning + tool calls) showing one successful solve.

Overview

Property	Value
Domains	Filesystem, GitHub
Tasks	20 total — 10 Filesystem + 10 GitHub
Tooling	MCP servers (filesystem tools / GitHub API tools)
Grading	Per-task `verify.py` — deterministic, programmatic pass/fail
Difficulty	L1 (single-hop) → L3 (multi-step, cross-source correlation)
Trajectories	One full solve per task, with `reasoning_content` (chain-of-thought) + tool calls

Environment

Each domain contributes one world under its directory. A world bundles a reusable environment (env/), the tool schemas the agent may call (tools/), an environment descriptor (bundle.json), and a tasks/ folder where each task carries its spec, its verifier, and a reference trajectory.

mcpmark/
├── filesystem/
│   └── travel_agent/
│       ├── bundle.json              # environment descriptor (runtime, refs)
│       ├── env/
│       │   └── fixtures/            # the materialised world (the agent's filesystem)
│       │       ├── README.md, crm_notes.md, vendor_directory.csv
│       │       ├── inbox/           # recent client emails (.txt)
│       │       └── trips/           # 8 trips: itinerary.md, manifests.json, receipts/
│       ├── tools/
│       │   └── tools.json           # filesystem MCP tool schemas
│       └── tasks/                   # 10 tasks (see below)
│           └── travel_agent__<name>/
│               ├── task.json            # prompt, env ref, verifier ref, limits, metadata
│               ├── verification/verify.py
│               └── trajectory.json      # full reference solve
├── github/
│   └── gif-for-cli/
│       ├── bundle.json
│       ├── env/
│       │   ├── setup.py             # materialises the repo into the eval org
│       │   └── fixtures/            # repo/ tree + issues.json + pulls.json + meta.json + world_profile.json
│       ├── tools/
│       │   └── tools.json           # GitHub MCP tool schemas
│       └── tasks/                   # 10 tasks (see below)
│           └── gif-for-cli__<name>/
│               ├── task.json
│               ├── verification/verify.py
│               └── trajectory.json

Filesystem world — Sorrentino Bespoke Travel (travel_agent)

Persona: Mia Sorrentino, a boutique travel planner. The agent works over her live working files.Contents:

8 trips under trips/<TRIP_ID>/ — each with itinerary.md (depart/return dates, base currency), manifests.json (flight + hotel, ISO-8601 with TZ offsets), and a receipts/ subdir holding mixed-currency receipts (USD / EUR / GBP).
vendor_directory.csv — 60 vendors across multiple cities and service tiers.
crm_notes.md — client/account notes covering the last four months.
inbox/ — recent client emails (thank-you notes, refund requests, invoices, referrals).

Tasks require cross-file joins (email → vendor directory → receipts → itinerary), defensive parsing (currency tags, comma-bearing fields, timezone-aware dates), and exhaustive scans, then writing a structured answer file back to the filesystem root.

GitHub world — google/gif-for-cli (gif-for-cli)

Repository: a snapshot of google/gif-for-cli (render GIFs/short videos as ASCII/256-color art in the terminal). Default branch master.Contents (materialised into the eval org by env/setup.py):

repo/ — the full source tree (33 tracked entries; 21 .py files under gif_for_cli/, plus tests/, docs/, setup.py, CI config) with real commit history.
issues.json — 20 issues.
pulls.json — 8 pull requests, several with distinct head branches (pr/36-tjni-pillow-10, pr/34-SASUPERNOVA-issue-33-pt1, …).
meta.json / world_profile.json — owner/repo/default-branch metadata and a structural profile of the tree.

Tasks require code archaeology (tracing a fix through commit history), provenance correlation (commit ↔ PR ↔ issue ↔ branch), and mutations (cutting branches, writing report files, opening PRs, applying labels) — all through the GitHub MCP tools.

Data

Path	Description
`<domain>/<world>/bundle.json`	Environment descriptor — runtime, interfaces, and refs to fixtures/tools/MCP config
`<domain>/<world>/env/`	The materialised world the agent acts on (filesystem tree, or repo snapshot + issues/pulls)
`<domain>/<world>/tools/tools.json`	MCP tool schemas available to the agent for that domain
`tasks/<task>/task.json`	Task spec: `user_prompt`, `env` ref, `verification` ref, `limits`, and `metadata` (difficulty, tags, MCP)
`tasks/<task>/verification/verify.py`	Deterministic verifier — inlines the gold value and returns pass/fail
`tasks/<task>/trajectory.json`	One full reference solve: `tools`, `messages` (with `reasoning_content` + `tool_calls`)

Sample

All 20 tasks overview

Filesystem — travel_agent (Sorrentino Bespoke Travel)

Task	Difficulty	Description
`aldridge_eur_thanked_receipts`	L2	Receipts for vendors named in a thank-you email, billed in base currency within the trip window
`aldridge_pref_mismatch`	L3	Thank-you vendors whose receipt currency mismatches their directory preference
`bramwell_driver_schedule_receipts`	L2	Driver-schedule receipts joined against the trip itinerary
`bramwell_premium_eur_receipts`	L2	Premium-tier vendor EUR receipt paths for the Bramwell offsite
`crm_amex_receipt_paths`	L3	Amex receipt paths for the CRM-flagged month via deep traversal
`crm_currency_mismatch_receipts`	L3	Receipts whose currency contradicts a CRM note
`della_rosa_wire_stay_receipts`	L2	Wire-paid stay receipts inside the refund/hotel window
`dellarosa_itinerary_dining_receipts`	L2	Dining receipts tied to itinerary stops for a refund case
`dual_city_pref_mismatches`	L3	Vendor-preference mismatches across two cities
`eu_premium_verified`	L3	EU premium-tier verified billing list (currency + window + directory join)

GitHub — gif-for-cli (google/gif-for-cli)

Task	Difficulty	Description
`gif-for-cli_L00`	L2	Audit the commit that made the Tenor API key configurable
`gif-for-cli_L01`	L2	Label open PRs sharing `generate/__init__.py` with the `_save_config` PR
`gif-for-cli_L03`	L3	Document unmerged `gif_for_cli/export.py` proposals
`gif-for-cli_L04`	L2	Coordinate CI-bootstrap follow-up with the unique open style-checking PR
`gif-for-cli_L05`	L3	Document export-pipeline provenance (commit ↔ PR ↔ issue ↔ branch)
`gif-for-cli_L06`	L1	Count `gif_for_cli/` Python files modified by the `256fgbg` commit
`gif-for-cli_L07`	L1	Find when `rows` was added to `TestRunFfmpeg`
`gif-for-cli_L08`	L1	Find the UTC date of a `test_256` assertion change
`gif-for-cli_L09`	L2	Audit the README-grooming commit with derived provenance facts
`gif-for-cli_L10`	L3	Document an `execute.py` bug audit (issues + PR + branches)

Full task: thanked-vendor receipts (filesystem — aldridge_eur_thanked_receipts)

Prompt:

The Aldridge family sent a thank-you note after their Paris trip (TRP-2025-001). I need the receipt files we can tie to vendors they actually name in that email, but only where we billed in the trip’s base currency during the on-trip window. Work only under trips/TRP-2025-001/receipts/. Read inbox/aldridge_thank_you.txt and trips/TRP-2025-001/itinerary.md for the rules below. A receipt’s vendor counts as thanked when either (a) the vendor’s exact Vendor: name from the receipt appears in the email body, or (b) the first whitespace-delimited token of that vendor name appears in the email body. Use the email body only. Keep a receipt only when all hold: (1) its Trip ID: is TRP-2025-001; (2) the vendor is thanked; (3) the receipt Date: is on/after the itinerary Depart date and on/before the Return date; (4) the receipt Currency: exactly matches the itinerary Base currency: line. Write matching paths to aldridge_eur_thanked_receipts.txt at the filesystem root: one relative path per line, sorted ascending, no header. Do not use python code. Do not modify existing files.

Why it’s hard: four AND-conditions, each grounded in a different file (email body, receipt header, itinerary). A naive substring match on the whole email (including the subject line) or a left-to-right field split picks up the wrong vendors; the date and currency gates must be applied with full precision.Grading (verification/verify.py): reads the written aldridge_eur_thanked_receipts.txt, compares the exact sorted path set against the inlined gold, and returns pass/fail.

Full agent trajectory (github — gif-for-cli_L06, 3 tool calls)

Each assistant message carries reasoning_content (chain-of-thought) alongside its spoken message and tool calls. Reasoning is shown in italics below. This is the simplest demo task (an L1 single-commit count) so the full solve fits in one view.System:

You are an agent that completes tasks independently. Use the tools provided to you to complete the task to the best of your ability.

User:

Please execute the following task in my repository mcpmark-eval-synthetic-tasks/gif-for-cli: I’m auditing the scope of the commit that introduced the 256 foreground/background color display mode. Find the commit authored by Sean Hayes whose subject is exactly Added 256fgbg display mode. and determine how many Python files under the gif_for_cli/ package directory (paths starting with gif_for_cli/ and ending with .py) it modified. Do not count files in tests/. Write ANSWER.md on master containing exactly one line COUNT=<number>.

Step 1 — Locate the commit:The user wants the commit authored by Sean Hayes with subject exactly “Added 256fgbg display mode.” Let me list commits to find its SHA.

I’ll start by searching for the commit authored by Sean Hayes with the exact subject.

github_list_commits({"owner": "mcpmark-eval-synthetic-tasks", "repo": "gif-for-cli"})

Tool result:

[{"sha": "...", "commit": {"message": "Version ..."}}, ..., {"sha": "5f35ae5dfb5191c282f7c6cbfee6b4ae189645ae", "commit": {"message": "Added 256fgbg display mode."}}]

Step 2 — Inspect the commit diff:Found it — SHA 5f35ae5…. Now I need the per-file diff to count the .py files under gif_for_cli/, excluding tests/.

github_get_commit({"owner": "mcpmark-eval-synthetic-tasks", "repo": "gif-for-cli", "ref": "5f35ae5dfb5191c282f7c6cbfee6b4ae189645ae"})

Tool result (files changed):

README.md                        (not .py)
gif_for_cli/__init__.py          ✓
gif_for_cli/constants.py         ✓
gif_for_cli/execute.py           ✓
gif_for_cli/...                  ✓  (6 total under gif_for_cli/*.py)
tests/test_*.py                  ✗  (excluded — tests/)

Counting paths that start with gif_for_cli/ and end with .py, excluding tests/: 6 files. The distractors — 10 total files changed, 9 Python files including tests — must not be counted.Step 3 — Write the answer:

github_create_or_update_file({"owner": "...", "repo": "gif-for-cli", "branch": "master", "path": "ANSWER.md", "content": "COUNT=6\n", "message": "Add audit count"})

Tool result: {"content": {"name": "ANSWER.md", "path": "ANSWER.md", ...}}Final answer:

Task completed. The commit 5f35ae5… authored by Sean Hayes (“Added 256fgbg display mode.”) modified 6 Python files under gif_for_cli/ (excluding tests/). Wrote COUNT=6 to ANSWER.md on master.

Grading — deterministic verifiers (verify.py)

Every task is graded by a self-contained verify.py that inlines the gold value (it does not read any facts.json at grade time) and checks the agent’s effect on the world. For gif-for-cli_L06:

Gold (inlined): COUNT = 6
Distractors that would trip a naive agent:
  total files changed            = 10
  all Python files (incl. tests/) = 9
Checks:
  1. ANSWER.md exists on master
  2. Content contains exactly "COUNT=6"

Filesystem verifiers read the file the agent was asked to write (e.g. aldridge_eur_thanked_receipts.txt) and compare the exact, sorted answer set against gold. GitHub verifiers re-query the live repo via the API (branch exists, file contents, PR title/body, labels) and assert each required effect. A task passes only if all checks hold.

Download

# Download MCPMark demo data (filesystem + github)
hf download jindidi/eigendata-demo-data --repo-type dataset --include "mcpmark/*"

Browse on Hugging Face

View MCPMark files

This is a free 20-task sample. The full MCPMark suite spans both domains across dozens of worlds (627 filesystem tasks, 606 github tasks), each with deterministic verifiers and reference SFT trajectories — see the Full Dataset page.

Eigen AI

API Reference

Platform

Products

Overview

Environment

Data

Sample

Download

Browse on Hugging Face

​Overview

​Environment

​Data

​Sample

​Download

Browse on Hugging Face

Overview

Environment

Data

Sample

Download