Demo Samples

These demo samples are a free 10-task slice of the Toolathlon dataset — a corpus of single-turn, tool-using agent tasks set inside a shared multi-application MCP workspace. Each task gives the agent one user request spanning real-world services (Canvas LMS, GitHub, BigQuery, WooCommerce, arXiv, Notion, Google Workspace, and more), and grades the result deterministically on what the agent actually produced. The full dataset ships 4,300 RL environments and 1,682 SFT trajectories across 102 task families — see the Full Dataset page.

Overview

Property	Value
Demo tasks	10 RL tasks across 10 families (one per family)
Demo SFT trajectories	10 graded-correct rollouts (`reward == 1.0`), one per task
Environment	32 MCP tool servers over one consistent world (the full ~32 GB shared environment ships with the full corpus)
Task format	Single-turn: one user request → tool calls → deterministic grading
Grading	Per-task deterministic grader — programmatic, no LLM judge
Trajectories	OpenAI chat format with `reasoning_content` (chain-of-thought) + `tool_calls`
Full corpus	4,300 RL tasks · 1,682 SFT trajectories · 102 families — see Full Dataset

Environment

Every task runs against one reusable environment — a multi-application workspace backed by 32 MCP tool servers launched as local subprocesses, with each task overlaying its own mock data and launching only the servers it needs. The demo bundles the 10 tasks and their SFT trajectories; the full ~32 GB shared environment (all 32 servers + fixtures) ships with the full corpus.

toolathlon/
├── sft/
│   ├── trajectories.jsonl          # 10 demo SFT rollouts, one JSON object per line
│   └── stats.json                  # per-domain counts
└── rl/
    └── tasks/task_<family>-rNNN/    # 10 demo task instances (one per family)
        ├── task.json                #   prompts, visible tools, limits, grading config
        ├── env/
        │   ├── initial_workspace/   #   files the agent starts with
        │   └── mock_mcp_output/     #   task-specific mock data layered on the shared world
        └── verification/
            ├── verify.py            #   the reward function
            └── grader_spec.json     #   the checks (the answer key)

The full corpus additionally ships rl/shared/toolathlon/ — the one shared world all 4,300 tasks run against (bundle.json + env/ ~32 GB + mcp/ 32 servers + tools/ registry + verification_lib/ grader library).

Data

Path	Description
`sft/trajectories.jsonl`	1,682 SFT rollouts in OpenAI chat format — `tools`, `messages` (system / user / assistant / tool), and `metadata`
`sft/stats.json`	Aggregate counts, per-domain breakdown, token distribution
`rl/tasks/<id>/task.json`	Task spec: `system_prompt`, `user_prompt`, `env`, `tools[]`, `limits`, `metadata`
`rl/tasks/<id>/env/initial_workspace/`	Files the agent starts with in its working directory
`rl/tasks/<id>/env/mock_mcp_output/`	Task-specific mock data layered on the shared world
`rl/tasks/<id>/verification/verify.py`	Deterministic reward function — reads the world end state, returns `[0.0, 1.0]`
`rl/tasks/<id>/verification/grader_spec.json`	The concrete checks: file existence, CSV content, tool-call effects, service state
`rl/shared/toolathlon/`	The shared environment — 32 MCP servers, fixture data, runner, grader library (full corpus only; not bundled in the demo download)

Sample

All task families (102)

The 102 families span education, research, data/ML, business, commerce, developer workflows, documents, and more — each instantiated across many scenario variants (the rNNN suffix):

Domain	Families
Education & LMS	`canvas-arrange-exam`, `canvas-art-manager`, `canvas-art-quiz`, `canvas-do-quiz`, `canvas-homework-grader-python`, `canvas-list-test`, `canvas-new-students-notification`, `canvas-submit-late-work`, `course-assistant`, `course-schedule`, `courses-ta-hws`, `university-course-selection`
Research & academia	`academic-pdf-report`, `academic-warning`, `add-bibtex`, `apply-phd-email`, `cvpr-research`, `find-alita-paper`, `hk-top-conf`, `paper-checker`
Data & ML	`ab-testing`, `imagenet`, `llm-training-dataset`, `logical-datasets-collection`, `merge-hf-datasets`, `verl-dataset`
Business & finance	`flagged-transactions`, `gdp-cr5-analysis`, `investment-decision-analysis`, `invoice-org`, `live-transactions`, `nvidia-market`, `nvidia-stock-analysis`, `oil-price`, `payable-invoice-checker`, `quantitative-financial-analysis`, `sales-accounting`, `stock-build-position`, `yahoo-analysis`
Commerce	`filter-low-selling-products`, `price-comparison`, `woocommerce-customer-survey`, `woocommerce-new-product`, `woocommerce-new-welcome`, `woocommerce-product-recall`, `woocommerce-stock-alert`, `woocommerce-update-cover`
Productivity & communication	`arrange-workspace`, `cooking-guidance`, `detect-revised-terms`, `dietary-health`, `email-paper-homepage`, `fillout-online-forms`, `game-statistics`, `identify-all-songs`, `interview-report`, `landing-task-reminder`, `meeting-assign`, `music-analysis`, `nhl-b2b-analysis`, `profile-update-online`, `set-conf-cr-ddl`, `student-interview`, `task-tracker`
Developer & infra	`dataset-license-issue`, `git-bug-hunt`, `git-milestone`, `git-repo`, `k8s-safety-audit`, `personal-website-construct`, `sla-timeout-monitor`, `sync-todo-to-readme`, `youtube-repo`
Documents & analysis	`excel-data-transformation`, `excel-market-research`, `ipad-edu-price`, `latex-prompt-box`, `machine-operating`, `ppt-analysis`, `privacy-desensitization`, `reimbursement-form-filler`
Travel & logistics	`subway-planning`, `train-ticket-plan`, `travel-exchange`, `travel-expense-reimbursement`, `trip-adviser`, `trip-itinerary-generator`, `upenn-campus-route`, `search-ca-school`
Productivity apps	`language-school`, `notion-find-job`, `notion-hr`, `notion-movies`, `notion-personal-website`, `update-material-inventory`, `vlm-history-completer`, `wandb-best-score`, `wandb-shortest-length`, `mrbeast-analysis`, `inter-final-performance-analysis`, `experiments-recordings`, `inventory-sync`

Representative tasks (10 diverse examples)

Task	MCP servers	Description
`ab-testing-r000`	BigQuery, Filesystem	Compute per-segment conversion rates from warehouse tables, fill a CSV, then conditionally create a GCS bucket or write a log entry based on the winner
`canvas-do-quiz-r003`	Memory, Canvas, Filesystem	Look up a student identity, list unfinished quizzes across courses, fill a catch-up plan CSV, then actually submit each quiz on Canvas
`find-alita-paper-r010`	arXiv, Filesystem, Scholarly	Search a paper catalog by title keyword and date cutoff, extract a code URL buried in the abstract, write a provenance card and a JSON record
`invoice-org-r008`	PDF Tools, Filesystem, Yahoo Finance, Excel	Parse invoices from a document inbox, convert multi-currency amounts using a rate series, write a summary CSV and a JSON total
`excel-data-transformation-r000`	Excel, Filesystem, Terminal	Flatten a stacked-banner-header workbook into a normalized table with derived columns
`interview-report-r000`	Filesystem, Word	Read 6 per-item documents, synthesize an interview report in Word format
`git-milestone-r000`	Filesystem, Terminal, Fetch	Vet open-source dependencies from a GitHub account — audit commits, releases, and license compliance
`cooking-guidance-r000`	Filesystem, HowToCook	Plan a weekly dish lineup, pull recipes, and produce a consolidated grocery top-up list
`machine-operating-r000`	BigQuery, Filesystem, Excel	Pull sensor readings from a wind-farm feed, identify calibration exceptions, produce a worklist
`music-analysis-r000`	Excel, Google Sheets, Terminal, Filesystem	Analyze chart-streak data from a weekly grid, compute statistics, and build an A&R briefing

Full task: A/B test analysis (ab-testing-r000)

System prompt:

Accessible workspace directory: /workspace/dumps/workspace […] If you believe the task is completed, you can either call the local-claim_done tool or respond without calling any tool to indicate completion.

User prompt:

We ran an A/B test on two product-recommendation email templates (A and B) across our catalog. Each catalog table in the warehouse logs per-window clicks and store_views for one template variant. For each catalog and each variant, compute the conversion rate as total store_views over total clicks, and fill the provided segment_rates.csv (keep its headers). Add one final OVERALL row per variant whose rate is the arithmetic mean of that variant’s per-catalog rates (not the pooled totals). Then decide the winner by comparing the two OVERALL means. If template B’s overall mean is higher, roll it out: create one GCS bucket named with prefix winner-template-b. Otherwise (A wins or a tie), make no bucket and instead write a single cloud_logging entry with decision template_A_retained and note No template change.

MCP servers: google-cloud (BigQuery + Cloud Storage + Logging), filesystemInitial workspace: a segment_rates.csv with headers and empty rate cells:

segment,variant,rate
Apparel,A,
Apparel,B,
Books,A,
...
overall,A,
overall,B,

The agent must query BigQuery tables, compute rates, fill the CSV, compare the overall means, then take the correct conditional action (bucket creation or log entry — but not both).

Full agent trajectory: paper provenance card (find-alita-paper-r010, 3 tool calls)

Each assistant message carries reasoning_content (chain-of-thought) alongside tool calls. This is a short task (3 tool calls) that illustrates the search → filter → write pattern.System:

Accessible workspace directory: /workspace/dumps/workspace […]

User:

We’re building provenance cards for the “Lumio” model series in the papers catalog. From the Lumio-titled records, take only those published on or before 2025-11-02, and pick the most recent of those. Write its provenance card to paper_report.txt (title / arxiv_abs_url / code_url lines) […] Mirror the same id and abs link into selected_record.json.

Step 1 — search the paper catalog:Reasoning:

We need to search the papers catalog for title containing “Lumio”. Use arxiv_local-search_papers.

Tool call:

arxiv_local-search_papers({"query": "Lumio", "max_results": 20})

Result (4 papers found):

{"total": 4, "results": [
  {"id": "1905.11946", "title": "Lumio: EfficientNet: ...", "published": "2025-12-12"},
  {"id": "2605.05104v1", "title": "Lumio: Scaling Vision ...", "published": "2025-02-20"},
  {"id": "2103.14023", "title": "Lumio: Swin Transformer ...", "published": "2024-11-21"},
  {"id": "1409.4842", "title": "Lumio: Going Deeper with Convolutions", "published": "2025-05-11"}
]}

Step 2 — filter by date, pick the most recent, write the report:Reasoning:

Check dates: 1905.11946 published 2025-12-12 (after cutoff — exclude). 2605.05104v1 published 2025-02-20 (before). 2103.14023 published 2024-11-21 (before). 1409.4842 published 2025-05-11 (before). Among qualifying papers, the most recent is 1409.4842 (2025-05-11). Its abstract contains the code repo URL https://github.com/Helios-Lab/Lumio.

Tool call:

filesystem-write_file({
  "path": "paper_report.txt",
  "content": "Lumio: Going Deeper with Convolutions\nhttps://arxiv.org/abs/1409.4842\nhttps://github.com/Helios-Lab/Lumio"
})

Result: Successfully wrote to paper_report.txt

Step 3 — write the JSON record:Tool call:

filesystem-write_file({
  "path": "selected_record.json",
  "content": "{\"id\": \"1409.4842\", \"arxiv_abs_url\": \"https://arxiv.org/abs/1409.4842\", ...}"
})

Result: Successfully wrote to selected_record.json

Step 4 — final message:

The provenance card for the most recent Lumio-titled paper published on or before 2025-11-02 has been created: Lumio: Going Deeper with Convolutions (arxiv.org/abs/1409.4842, code at github.com/Helios-Lab/Lumio).

Why it’s hard: The agent must (a) exclude one paper that falls after the date cutoff (a common off-by-one trap), (b) pick the most recent among the remaining three (not the first result), and (c) extract the code URL from the abstract text — it’s not a stored field.

Grading — deterministic checks (ab-testing-r000)

The grader runs 6 concrete checks — no LLM judge:

Check	Kind	What it verifies
`file_exists:segment_rates.csv`	`file_exists`	The output CSV was written
`csv_columns:segment_rates.csv`	`csv_columns`	Columns are exactly `segment`, `variant`, `rate`
`csv_row_bounds:segment_rates.csv`	`csv_row_bounds`	Exactly 16 rows (7 segments × 2 variants + 2 overall)
`tabular_expected_rows:segment_rates.csv`	`tabular_expected_rows`	Every row matches the gold values (e.g. `Apparel,A,70.936`)
`mock_fired:storage_create_bucket`	`mock_tool_calls`	A bucket with prefix `winner-template-b` was created (because B wins)
`mock_forbidden:logging_write_log`	`mock_tool_calls`	No log entry was written (wrong branch for B-wins)

The task passes only when all 6 checks pass (score == 1.0). A single wrong rate, a missing overall row, or taking the wrong conditional action fails the task.For find-alita-paper-r010, the grader checks that paper_report.txt contains the exact title (Lumio: Going Deeper with Convolutions), the correct arXiv URL (arxiv.org/abs/1409.4842), and the code repository link (github.com/Helios-Lab/Lumio). Picking the wrong paper or missing the GitHub URL from the abstract fails the task.

Download

# Download Toolathlon demo data
hf download jindidi/eigendata-demo-data --repo-type dataset --include "toolathlon/*"

Browse on Hugging Face

View Toolathlon files

For the complete Toolathlon corpus — all 4,300 tasks, 1,682 SFT trajectories, the full 32-server environment, and commercial licensing — see the Full Dataset page.

Eigen AI

API Reference

Platform

Products

Overview

Environment

Data

Sample

Download

Browse on Hugging Face

​Overview

​Environment

​Data

​Sample

​Download

Browse on Hugging Face

Overview

Environment

Data

Sample

Download