Want to try it first? The Demo Samples page walks through the environment layout, representative tasks, a full agent trajectory, and the grading system.
What Toolathlon is
A Toolathlon task drops the agent into a shared simulated world — one reusable environment of interconnected services populated with courses, students, emails, repositories, datasets, storefronts, spreadsheets, documents, and more — then asks it to accomplish a concrete goal. The agent works by calling tools through the Model Context Protocol (MCP), namespaced<server>-<tool> (e.g. canvas-canvas_get_quiz, filesystem-write_file, google_sheet-google_sheet_read_range). When the rollout ends, a per-task deterministic grader reads the resulting world state and returns a reward in [0.0, 1.0].
Tasks fall into two roles:
- SFT + RL (1,682 tasks) — ship both a graded-correct trajectory (for SFT) and a runnable environment (for RL). The SFT trajectories are successful rollouts of these RL tasks.
- RL-only (2,618 tasks) — ship the runnable environment and grader but no example trajectory. They are extra environments for RL training.
At a glance
| Property | Value |
|---|---|
| Total RL tasks | 4,300 |
| SFT trajectories | 1,682 (graded-correct rollouts of 1,682 of the RL tasks) |
| RL-only tasks | 2,618 (no example trajectory; fully runnable and scorable) |
| Task families | 102 distinct task types |
| Shared environment | 1 (toolathlon) — 32 MCP tool servers over one consistent world |
| Agent tools | 32 MCP servers — 26 mock service servers + 6 real local servers |
| Task format | Single-turn: one user request, graded on final world state |
| Grading | Per-task deterministic grader — programmatic, no LLM judge |
| Reward signal | [0.0, 1.0]; passed = true only when every check passes (score == 1.0) |
| Suitable for | Tool-use SFT (with trajectories) · RL (verifier as reward) · benchmark |
What’s inside
| Component | Description |
|---|---|
| Shared environment | One reusable world used by all 4,300 tasks — 32 tool servers launched as local subprocesses, backed by ~32 GB of shared fixture data. Each task overlays its own mock data on top. |
| Tasks | One folder per task, each containing a task spec (prompts, visible tools, limits), an environment directory (initial workspace + mock overlays), and a verification directory (grader + answer key). |
| SFT trajectories | 1,682 rollouts in OpenAI chat format with tools, messages (system / user / assistant / tool), and metadata. |
| Reward verifiers | A deterministic grader per task. Checks are concrete: required output files, exact CSV/JSON content, correct tool-call effects, expected mock-service end state. |
| Reference runner | A complete runner script that brings up the environment, drives a model, and scores the result — usable as-is or as a template for custom agent loops. |
Task categories
The 102 task families span a wide range of real-world workflows. Each family is instantiated across many scenario variants, producing the 4,300 total tasks. The environment provides 32 MCP tool servers: 26 mock service servers (Canvas, email, WooCommerce, GitHub, Git, Hugging Face, Notion, Google Sheets/Calendar/Forms/Maps, Snowflake, BigQuery, Weights & Biases, arXiv, Scholarly, Yahoo Finance, YouTube, HowToCook, Memory, Kubernetes, Railway 12306, Fetch, Playwright) and 6 real local servers (filesystem, terminal, Excel, Word, PowerPoint, PDF). A task only launches the servers it needs. Everything is local; no network required. Representative families:| Domain | Example families |
|---|---|
| Education & LMS | canvas-do-quiz, canvas-arrange-exam, canvas-homework-grader-python, canvas-art-quiz, canvas-submit-late-work, course-schedule, course-assistant |
| Research & academia | find-alita-paper, cvpr-research, paper-checker, academic-pdf-report, add-bibtex, apply-phd-email |
| Data & ML | imagenet, llm-training-dataset, verl-dataset, merge-hf-datasets, logical-datasets-collection, ab-testing |
| Business & finance | invoice-org, payable-invoice-checker, sales-accounting, investment-decision-analysis, flagged-transactions, travel-expense-reimbursement |
| Commerce | woocommerce-new-welcome, woocommerce-product-recall, woocommerce-stock-alert, woocommerce-customer-survey |
| Productivity | meeting-assign, interview-report, reimbursement-form-filler, arrange-workspace, fillout-online-forms |
| Developer workflows | git-repo, git-milestone, personal-website-construct, youtube-repo |
| Document & analysis | ppt-analysis, excel-data-transformation, excel-market-research, detect-revised-terms, privacy-desensitization |
Training utility
Supervised fine-tuning (SFT) a smaller open-weight model on Toolathlon trajectories yields strict-pass gains on the held-out Toolathlon benchmark — 108 expert-authored tasks (thefinalpool split) run against 600+ real tools across ~10 live services (Google Workspace, Snowflake, email, GitHub, Canvas LMS, Notion, WooCommerce, Kubernetes, arXiv, and a local filesystem). Each task ships a curated initial state in containerized services; grading scripts are injected only after the agent finishes, eliminating leakage.
Training data. ~341 SFT samples in single-turn, multi-step messages format with per-turn reasoning_content (100% coverage). The system prompt inlines the same MCP tool signatures the benchmark exposes. Trajectories are genuinely long-horizon and tool-heavy — median 13 assistant turns (p25–p75: 9–21) and 23 tool calls (14–31) per sample — spanning shell, files, Snowflake SQL, email, GitHub, Canvas, and Notion. Base model: Qwen3.6-27B.
Metric. Strict pass@1 (a task passes only if all post-hoc verification checks pass), temperature = 0.6.
Results.
| Model | Passed | pass@1 | Config |
|---|---|---|---|
| Baseline | 26/108 | 24.0% | — |
| Toolathlon-SFT | 30/108 | 27.8% | +3.8 pts |
- Git-history forensics + communication (
git-bug-hunt): walk a repo’s history to find the earliest commit that introduced a given variable, extract its hash, author, and full message, format them against a template, and email the author with an exact required subject line — a chain across git → filesystem → email that must stay correctly ordered end-to-end. - Structured data analysis → report generation (
nhl-b2b-analysis): parse a full season schedule spreadsheet, compute each team’s back-to-back sets broken down across all four home/away configurations, and emit a new spreadsheet with exact headers. - Conditional, policy-branching execution (
ab-testing): analyze BigQuery clickstream data, compute overall conversion rates, and then branch correctly — create a Cloud Storage bucket if B wins, or write a specific log entry otherwise — getting both the computation and the conditional side-effect right. - Document extraction → structured fill (
reimbursement-form-filler): pull fields out of PDF receipts, fill a fixed Excel template without disturbing its layout, and rename the output — a parse → compute → format-and-write pattern that repeats across e-commerce recall workflows and LMS grading tasks as well.