Overview
| Property | Value |
|---|---|
| Demo tasks | 10 RL tasks across 10 families (one per family) |
| Demo SFT trajectories | 10 graded-correct rollouts (reward == 1.0), one per task |
| Environment | 32 MCP tool servers over one consistent world (the full ~32 GB shared environment ships with the full corpus) |
| Task format | Single-turn: one user request → tool calls → deterministic grading |
| Grading | Per-task deterministic grader — programmatic, no LLM judge |
| Trajectories | OpenAI chat format with reasoning_content (chain-of-thought) + tool_calls |
| Full corpus | 4,300 RL tasks · 1,682 SFT trajectories · 102 families — see Full Dataset |
Environment
Every task runs against one reusable environment — a multi-application workspace backed by 32 MCP tool servers launched as local subprocesses, with each task overlaying its own mock data and launching only the servers it needs. The demo bundles the 10 tasks and their SFT trajectories; the full ~32 GB shared environment (all 32 servers + fixtures) ships with the full corpus.rl/shared/toolathlon/ — the one shared world all 4,300 tasks run against (bundle.json + env/ ~32 GB + mcp/ 32 servers + tools/ registry + verification_lib/ grader library).
Data
| Path | Description |
|---|---|
sft/trajectories.jsonl | 1,682 SFT rollouts in OpenAI chat format — tools, messages (system / user / assistant / tool), and metadata |
sft/stats.json | Aggregate counts, per-domain breakdown, token distribution |
rl/tasks/<id>/task.json | Task spec: system_prompt, user_prompt, env, tools[], limits, metadata |
rl/tasks/<id>/env/initial_workspace/ | Files the agent starts with in its working directory |
rl/tasks/<id>/env/mock_mcp_output/ | Task-specific mock data layered on the shared world |
rl/tasks/<id>/verification/verify.py | Deterministic reward function — reads the world end state, returns [0.0, 1.0] |
rl/tasks/<id>/verification/grader_spec.json | The concrete checks: file existence, CSV content, tool-call effects, service state |
rl/shared/toolathlon/ | The shared environment — 32 MCP servers, fixture data, runner, grader library (full corpus only; not bundled in the demo download) |
Sample
All task families (102)
All task families (102)
The 102 families span education, research, data/ML, business, commerce, developer workflows, documents, and more — each instantiated across many scenario variants (the
rNNN suffix):| Domain | Families |
|---|---|
| Education & LMS | canvas-arrange-exam, canvas-art-manager, canvas-art-quiz, canvas-do-quiz, canvas-homework-grader-python, canvas-list-test, canvas-new-students-notification, canvas-submit-late-work, course-assistant, course-schedule, courses-ta-hws, university-course-selection |
| Research & academia | academic-pdf-report, academic-warning, add-bibtex, apply-phd-email, cvpr-research, find-alita-paper, hk-top-conf, paper-checker |
| Data & ML | ab-testing, imagenet, llm-training-dataset, logical-datasets-collection, merge-hf-datasets, verl-dataset |
| Business & finance | flagged-transactions, gdp-cr5-analysis, investment-decision-analysis, invoice-org, live-transactions, nvidia-market, nvidia-stock-analysis, oil-price, payable-invoice-checker, quantitative-financial-analysis, sales-accounting, stock-build-position, yahoo-analysis |
| Commerce | filter-low-selling-products, price-comparison, woocommerce-customer-survey, woocommerce-new-product, woocommerce-new-welcome, woocommerce-product-recall, woocommerce-stock-alert, woocommerce-update-cover |
| Productivity & communication | arrange-workspace, cooking-guidance, detect-revised-terms, dietary-health, email-paper-homepage, fillout-online-forms, game-statistics, identify-all-songs, interview-report, landing-task-reminder, meeting-assign, music-analysis, nhl-b2b-analysis, profile-update-online, set-conf-cr-ddl, student-interview, task-tracker |
| Developer & infra | dataset-license-issue, git-bug-hunt, git-milestone, git-repo, k8s-safety-audit, personal-website-construct, sla-timeout-monitor, sync-todo-to-readme, youtube-repo |
| Documents & analysis | excel-data-transformation, excel-market-research, ipad-edu-price, latex-prompt-box, machine-operating, ppt-analysis, privacy-desensitization, reimbursement-form-filler |
| Travel & logistics | subway-planning, train-ticket-plan, travel-exchange, travel-expense-reimbursement, trip-adviser, trip-itinerary-generator, upenn-campus-route, search-ca-school |
| Productivity apps | language-school, notion-find-job, notion-hr, notion-movies, notion-personal-website, update-material-inventory, vlm-history-completer, wandb-best-score, wandb-shortest-length, mrbeast-analysis, inter-final-performance-analysis, experiments-recordings, inventory-sync |
Representative tasks (10 diverse examples)
Representative tasks (10 diverse examples)
| Task | MCP servers | Description |
|---|---|---|
ab-testing-r000 | BigQuery, Filesystem | Compute per-segment conversion rates from warehouse tables, fill a CSV, then conditionally create a GCS bucket or write a log entry based on the winner |
canvas-do-quiz-r003 | Memory, Canvas, Filesystem | Look up a student identity, list unfinished quizzes across courses, fill a catch-up plan CSV, then actually submit each quiz on Canvas |
find-alita-paper-r010 | arXiv, Filesystem, Scholarly | Search a paper catalog by title keyword and date cutoff, extract a code URL buried in the abstract, write a provenance card and a JSON record |
invoice-org-r008 | PDF Tools, Filesystem, Yahoo Finance, Excel | Parse invoices from a document inbox, convert multi-currency amounts using a rate series, write a summary CSV and a JSON total |
excel-data-transformation-r000 | Excel, Filesystem, Terminal | Flatten a stacked-banner-header workbook into a normalized table with derived columns |
interview-report-r000 | Filesystem, Word | Read 6 per-item documents, synthesize an interview report in Word format |
git-milestone-r000 | Filesystem, Terminal, Fetch | Vet open-source dependencies from a GitHub account — audit commits, releases, and license compliance |
cooking-guidance-r000 | Filesystem, HowToCook | Plan a weekly dish lineup, pull recipes, and produce a consolidated grocery top-up list |
machine-operating-r000 | BigQuery, Filesystem, Excel | Pull sensor readings from a wind-farm feed, identify calibration exceptions, produce a worklist |
music-analysis-r000 | Excel, Google Sheets, Terminal, Filesystem | Analyze chart-streak data from a weekly grid, compute statistics, and build an A&R briefing |
Full task: A/B test analysis (ab-testing-r000)
Full task: A/B test analysis (ab-testing-r000)
System prompt:The agent must query BigQuery tables, compute rates, fill the CSV, compare the overall means, then take the correct conditional action (bucket creation or log entry — but not both).
Accessible workspace directory: /workspace/dumps/workspace […]
If you believe the task is completed, you can either call the local-claim_done tool or respond without calling any tool to indicate completion.
User prompt:We ran an A/B test on two product-recommendation email templates (A and B) across our catalog. Each catalog table in the warehouse logs per-window clicks and store_views for one template variant. For each catalog and each variant, compute the conversion rate as total store_views over total clicks, and fill the providedMCP servers:segment_rates.csv(keep its headers). Add one final OVERALL row per variant whose rate is the arithmetic mean of that variant’s per-catalog rates (not the pooled totals). Then decide the winner by comparing the two OVERALL means. If template B’s overall mean is higher, roll it out: create one GCS bucket named with prefixwinner-template-b. Otherwise (A wins or a tie), make no bucket and instead write a single cloud_logging entry with decisiontemplate_A_retainedand noteNo template change.
google-cloud (BigQuery + Cloud Storage + Logging), filesystemInitial workspace: a segment_rates.csv with headers and empty rate cells:Full agent trajectory: paper provenance card (find-alita-paper-r010, 3 tool calls)
Full agent trajectory: paper provenance card (find-alita-paper-r010, 3 tool calls)
Each assistant message carries
Step 1 — search the paper catalog:Reasoning:Result (4 papers found):
Step 2 — filter by date, pick the most recent, write the report:Reasoning:Result:
Step 3 — write the JSON record:Tool call:Result:
Step 4 — final message:
reasoning_content (chain-of-thought) alongside tool calls. This is a short task (3 tool calls) that illustrates the search → filter → write pattern.System:Accessible workspace directory: /workspace/dumps/workspace […]User:
We’re building provenance cards for the “Lumio” model series in thepaperscatalog. From the Lumio-titled records, take only those published on or before 2025-11-02, and pick the most recent of those. Write its provenance card topaper_report.txt(title / arxiv_abs_url / code_url lines) […] Mirror the same id and abs link intoselected_record.json.
Step 1 — search the paper catalog:Reasoning:
We need to search the papers catalog for title containing “Lumio”. Use arxiv_local-search_papers.Tool call:
Step 2 — filter by date, pick the most recent, write the report:Reasoning:
Check dates: 1905.11946 published 2025-12-12 (after cutoff — exclude). 2605.05104v1 published 2025-02-20 (before). 2103.14023 published 2024-11-21 (before). 1409.4842 published 2025-05-11 (before). Among qualifying papers, the most recent is 1409.4842 (2025-05-11). Its abstract contains the code repo URL https://github.com/Helios-Lab/Lumio.
Tool call:Successfully wrote to paper_report.txtStep 3 — write the JSON record:Tool call:
Successfully wrote to selected_record.jsonStep 4 — final message:
The provenance card for the most recent Lumio-titled paper published on or before 2025-11-02 has been created: Lumio: Going Deeper with Convolutions (arxiv.org/abs/1409.4842, code at github.com/Helios-Lab/Lumio).Why it’s hard: The agent must (a) exclude one paper that falls after the date cutoff (a common off-by-one trap), (b) pick the most recent among the remaining three (not the first result), and (c) extract the code URL from the abstract text — it’s not a stored field.
Grading — deterministic checks (ab-testing-r000)
Grading — deterministic checks (ab-testing-r000)
The grader runs 6 concrete checks — no LLM judge:
The task passes only when all 6 checks pass (
| Check | Kind | What it verifies |
|---|---|---|
file_exists:segment_rates.csv | file_exists | The output CSV was written |
csv_columns:segment_rates.csv | csv_columns | Columns are exactly segment, variant, rate |
csv_row_bounds:segment_rates.csv | csv_row_bounds | Exactly 16 rows (7 segments × 2 variants + 2 overall) |
tabular_expected_rows:segment_rates.csv | tabular_expected_rows | Every row matches the gold values (e.g. Apparel,A,70.936) |
mock_fired:storage_create_bucket | mock_tool_calls | A bucket with prefix winner-template-b was created (because B wins) |
mock_forbidden:logging_write_log | mock_tool_calls | No log entry was written (wrong branch for B-wins) |
score == 1.0). A single wrong rate, a missing overall row, or taking the wrong conditional action fails the task.For find-alita-paper-r010, the grader checks that paper_report.txt contains the exact title (Lumio: Going Deeper with Convolutions), the correct arXiv URL (arxiv.org/abs/1409.4842), and the code repository link (github.com/Helios-Lab/Lumio). Picking the wrong paper or missing the GitHub URL from the abstract fails the task.Download
Browse on Hugging Face
View Toolathlon files
For the complete Toolathlon corpus — all 4,300 tasks, 1,682 SFT trajectories, the full 32-server environment, and commercial licensing — see the Full Dataset page.