Skip to main content
These demo samples are a free 10-task slice of the Toolathlon dataset — a corpus of single-turn, tool-using agent tasks set inside a shared multi-application MCP workspace. Each task gives the agent one user request spanning real-world services (Canvas LMS, GitHub, BigQuery, WooCommerce, arXiv, Notion, Google Workspace, and more), and grades the result deterministically on what the agent actually produced. The full dataset ships 4,300 RL environments and 1,682 SFT trajectories across 102 task families — see the Full Dataset page.

Overview

PropertyValue
Demo tasks10 RL tasks across 10 families (one per family)
Demo SFT trajectories10 graded-correct rollouts (reward == 1.0), one per task
Environment32 MCP tool servers over one consistent world (the full ~32 GB shared environment ships with the full corpus)
Task formatSingle-turn: one user request → tool calls → deterministic grading
GradingPer-task deterministic grader — programmatic, no LLM judge
TrajectoriesOpenAI chat format with reasoning_content (chain-of-thought) + tool_calls
Full corpus4,300 RL tasks · 1,682 SFT trajectories · 102 families — see Full Dataset

Environment

Every task runs against one reusable environment — a multi-application workspace backed by 32 MCP tool servers launched as local subprocesses, with each task overlaying its own mock data and launching only the servers it needs. The demo bundles the 10 tasks and their SFT trajectories; the full ~32 GB shared environment (all 32 servers + fixtures) ships with the full corpus.
toolathlon/
├── sft/
│   ├── trajectories.jsonl          # 10 demo SFT rollouts, one JSON object per line
│   └── stats.json                  # per-domain counts
└── rl/
    └── tasks/task_<family>-rNNN/    # 10 demo task instances (one per family)
        ├── task.json                #   prompts, visible tools, limits, grading config
        ├── env/
        │   ├── initial_workspace/   #   files the agent starts with
        │   └── mock_mcp_output/     #   task-specific mock data layered on the shared world
        └── verification/
            ├── verify.py            #   the reward function
            └── grader_spec.json     #   the checks (the answer key)
The full corpus additionally ships rl/shared/toolathlon/ — the one shared world all 4,300 tasks run against (bundle.json + env/ ~32 GB + mcp/ 32 servers + tools/ registry + verification_lib/ grader library).

Data

PathDescription
sft/trajectories.jsonl1,682 SFT rollouts in OpenAI chat format — tools, messages (system / user / assistant / tool), and metadata
sft/stats.jsonAggregate counts, per-domain breakdown, token distribution
rl/tasks/<id>/task.jsonTask spec: system_prompt, user_prompt, env, tools[], limits, metadata
rl/tasks/<id>/env/initial_workspace/Files the agent starts with in its working directory
rl/tasks/<id>/env/mock_mcp_output/Task-specific mock data layered on the shared world
rl/tasks/<id>/verification/verify.pyDeterministic reward function — reads the world end state, returns [0.0, 1.0]
rl/tasks/<id>/verification/grader_spec.jsonThe concrete checks: file existence, CSV content, tool-call effects, service state
rl/shared/toolathlon/The shared environment — 32 MCP servers, fixture data, runner, grader library (full corpus only; not bundled in the demo download)

Sample

The 102 families span education, research, data/ML, business, commerce, developer workflows, documents, and more — each instantiated across many scenario variants (the rNNN suffix):
DomainFamilies
Education & LMScanvas-arrange-exam, canvas-art-manager, canvas-art-quiz, canvas-do-quiz, canvas-homework-grader-python, canvas-list-test, canvas-new-students-notification, canvas-submit-late-work, course-assistant, course-schedule, courses-ta-hws, university-course-selection
Research & academiaacademic-pdf-report, academic-warning, add-bibtex, apply-phd-email, cvpr-research, find-alita-paper, hk-top-conf, paper-checker
Data & MLab-testing, imagenet, llm-training-dataset, logical-datasets-collection, merge-hf-datasets, verl-dataset
Business & financeflagged-transactions, gdp-cr5-analysis, investment-decision-analysis, invoice-org, live-transactions, nvidia-market, nvidia-stock-analysis, oil-price, payable-invoice-checker, quantitative-financial-analysis, sales-accounting, stock-build-position, yahoo-analysis
Commercefilter-low-selling-products, price-comparison, woocommerce-customer-survey, woocommerce-new-product, woocommerce-new-welcome, woocommerce-product-recall, woocommerce-stock-alert, woocommerce-update-cover
Productivity & communicationarrange-workspace, cooking-guidance, detect-revised-terms, dietary-health, email-paper-homepage, fillout-online-forms, game-statistics, identify-all-songs, interview-report, landing-task-reminder, meeting-assign, music-analysis, nhl-b2b-analysis, profile-update-online, set-conf-cr-ddl, student-interview, task-tracker
Developer & infradataset-license-issue, git-bug-hunt, git-milestone, git-repo, k8s-safety-audit, personal-website-construct, sla-timeout-monitor, sync-todo-to-readme, youtube-repo
Documents & analysisexcel-data-transformation, excel-market-research, ipad-edu-price, latex-prompt-box, machine-operating, ppt-analysis, privacy-desensitization, reimbursement-form-filler
Travel & logisticssubway-planning, train-ticket-plan, travel-exchange, travel-expense-reimbursement, trip-adviser, trip-itinerary-generator, upenn-campus-route, search-ca-school
Productivity appslanguage-school, notion-find-job, notion-hr, notion-movies, notion-personal-website, update-material-inventory, vlm-history-completer, wandb-best-score, wandb-shortest-length, mrbeast-analysis, inter-final-performance-analysis, experiments-recordings, inventory-sync
TaskMCP serversDescription
ab-testing-r000BigQuery, FilesystemCompute per-segment conversion rates from warehouse tables, fill a CSV, then conditionally create a GCS bucket or write a log entry based on the winner
canvas-do-quiz-r003Memory, Canvas, FilesystemLook up a student identity, list unfinished quizzes across courses, fill a catch-up plan CSV, then actually submit each quiz on Canvas
find-alita-paper-r010arXiv, Filesystem, ScholarlySearch a paper catalog by title keyword and date cutoff, extract a code URL buried in the abstract, write a provenance card and a JSON record
invoice-org-r008PDF Tools, Filesystem, Yahoo Finance, ExcelParse invoices from a document inbox, convert multi-currency amounts using a rate series, write a summary CSV and a JSON total
excel-data-transformation-r000Excel, Filesystem, TerminalFlatten a stacked-banner-header workbook into a normalized table with derived columns
interview-report-r000Filesystem, WordRead 6 per-item documents, synthesize an interview report in Word format
git-milestone-r000Filesystem, Terminal, FetchVet open-source dependencies from a GitHub account — audit commits, releases, and license compliance
cooking-guidance-r000Filesystem, HowToCookPlan a weekly dish lineup, pull recipes, and produce a consolidated grocery top-up list
machine-operating-r000BigQuery, Filesystem, ExcelPull sensor readings from a wind-farm feed, identify calibration exceptions, produce a worklist
music-analysis-r000Excel, Google Sheets, Terminal, FilesystemAnalyze chart-streak data from a weekly grid, compute statistics, and build an A&R briefing
System prompt:
Accessible workspace directory: /workspace/dumps/workspace […] If you believe the task is completed, you can either call the local-claim_done tool or respond without calling any tool to indicate completion.
User prompt:
We ran an A/B test on two product-recommendation email templates (A and B) across our catalog. Each catalog table in the warehouse logs per-window clicks and store_views for one template variant. For each catalog and each variant, compute the conversion rate as total store_views over total clicks, and fill the provided segment_rates.csv (keep its headers). Add one final OVERALL row per variant whose rate is the arithmetic mean of that variant’s per-catalog rates (not the pooled totals). Then decide the winner by comparing the two OVERALL means. If template B’s overall mean is higher, roll it out: create one GCS bucket named with prefix winner-template-b. Otherwise (A wins or a tie), make no bucket and instead write a single cloud_logging entry with decision template_A_retained and note No template change.
MCP servers: google-cloud (BigQuery + Cloud Storage + Logging), filesystemInitial workspace: a segment_rates.csv with headers and empty rate cells:
segment,variant,rate
Apparel,A,
Apparel,B,
Books,A,
...
overall,A,
overall,B,
The agent must query BigQuery tables, compute rates, fill the CSV, compare the overall means, then take the correct conditional action (bucket creation or log entry — but not both).
Each assistant message carries reasoning_content (chain-of-thought) alongside tool calls. This is a short task (3 tool calls) that illustrates the search → filter → write pattern.System:
Accessible workspace directory: /workspace/dumps/workspace […]
User:
We’re building provenance cards for the “Lumio” model series in the papers catalog. From the Lumio-titled records, take only those published on or before 2025-11-02, and pick the most recent of those. Write its provenance card to paper_report.txt (title / arxiv_abs_url / code_url lines) […] Mirror the same id and abs link into selected_record.json.

Step 1 — search the paper catalog:Reasoning:
We need to search the papers catalog for title containing “Lumio”. Use arxiv_local-search_papers.
Tool call:
arxiv_local-search_papers({"query": "Lumio", "max_results": 20})
Result (4 papers found):
{"total": 4, "results": [
  {"id": "1905.11946", "title": "Lumio: EfficientNet: ...", "published": "2025-12-12"},
  {"id": "2605.05104v1", "title": "Lumio: Scaling Vision ...", "published": "2025-02-20"},
  {"id": "2103.14023", "title": "Lumio: Swin Transformer ...", "published": "2024-11-21"},
  {"id": "1409.4842", "title": "Lumio: Going Deeper with Convolutions", "published": "2025-05-11"}
]}

Step 2 — filter by date, pick the most recent, write the report:Reasoning:
Check dates: 1905.11946 published 2025-12-12 (after cutoff — exclude). 2605.05104v1 published 2025-02-20 (before). 2103.14023 published 2024-11-21 (before). 1409.4842 published 2025-05-11 (before). Among qualifying papers, the most recent is 1409.4842 (2025-05-11). Its abstract contains the code repo URL https://github.com/Helios-Lab/Lumio.
Tool call:
filesystem-write_file({
  "path": "paper_report.txt",
  "content": "Lumio: Going Deeper with Convolutions\nhttps://arxiv.org/abs/1409.4842\nhttps://github.com/Helios-Lab/Lumio"
})
Result: Successfully wrote to paper_report.txt
Step 3 — write the JSON record:Tool call:
filesystem-write_file({
  "path": "selected_record.json",
  "content": "{\"id\": \"1409.4842\", \"arxiv_abs_url\": \"https://arxiv.org/abs/1409.4842\", ...}"
})
Result: Successfully wrote to selected_record.json
Step 4 — final message:
The provenance card for the most recent Lumio-titled paper published on or before 2025-11-02 has been created: Lumio: Going Deeper with Convolutions (arxiv.org/abs/1409.4842, code at github.com/Helios-Lab/Lumio).
Why it’s hard: The agent must (a) exclude one paper that falls after the date cutoff (a common off-by-one trap), (b) pick the most recent among the remaining three (not the first result), and (c) extract the code URL from the abstract text — it’s not a stored field.
The grader runs 6 concrete checks — no LLM judge:
CheckKindWhat it verifies
file_exists:segment_rates.csvfile_existsThe output CSV was written
csv_columns:segment_rates.csvcsv_columnsColumns are exactly segment, variant, rate
csv_row_bounds:segment_rates.csvcsv_row_boundsExactly 16 rows (7 segments × 2 variants + 2 overall)
tabular_expected_rows:segment_rates.csvtabular_expected_rowsEvery row matches the gold values (e.g. Apparel,A,70.936)
mock_fired:storage_create_bucketmock_tool_callsA bucket with prefix winner-template-b was created (because B wins)
mock_forbidden:logging_write_logmock_tool_callsNo log entry was written (wrong branch for B-wins)
The task passes only when all 6 checks pass (score == 1.0). A single wrong rate, a missing overall row, or taking the wrong conditional action fails the task.For find-alita-paper-r010, the grader checks that paper_report.txt contains the exact title (Lumio: Going Deeper with Convolutions), the correct arXiv URL (arxiv.org/abs/1409.4842), and the code repository link (github.com/Helios-Lab/Lumio). Picking the wrong paper or missing the GitHub URL from the abstract fails the task.

Download

# Download Toolathlon demo data
hf download jindidi/eigendata-demo-data --repo-type dataset --include "toolathlon/*"

Browse on Hugging Face

View Toolathlon files
For the complete Toolathlon corpus — all 4,300 tasks, 1,682 SFT trajectories, the full 32-server environment, and commercial licensing — see the Full Dataset page.