Skip to main content
Toolathlon is a corpus of single-turn, tool-using agent tasks set inside a shared multi-application workspace backed by 32 MCP tool servers — Canvas (LMS), email, WooCommerce, GitHub, Hugging Face, Notion, Google Sheets/Calendar/Forms/Maps, Snowflake, BigQuery, arXiv, a filesystem, a terminal, and document tools (Excel, Word, PowerPoint, PDF). Each task gives the agent one user request and grades the result deterministically on what it actually produced: the files it wrote, the tool calls it made, and the final state of the mock services. The full corpus spans 102 task families and 4,300 RL environments; data generated and verified by EigenData-CLI.
Want to try it first? The Demo Samples page walks through the environment layout, representative tasks, a full agent trajectory, and the grading system.

What Toolathlon is

A Toolathlon task drops the agent into a shared simulated world — one reusable environment of interconnected services populated with courses, students, emails, repositories, datasets, storefronts, spreadsheets, documents, and more — then asks it to accomplish a concrete goal. The agent works by calling tools through the Model Context Protocol (MCP), namespaced <server>-<tool> (e.g. canvas-canvas_get_quiz, filesystem-write_file, google_sheet-google_sheet_read_range). When the rollout ends, a per-task deterministic grader reads the resulting world state and returns a reward in [0.0, 1.0]. Tasks fall into two roles:
  • SFT + RL (1,682 tasks) — ship both a graded-correct trajectory (for SFT) and a runnable environment (for RL). The SFT trajectories are successful rollouts of these RL tasks.
  • RL-only (2,618 tasks) — ship the runnable environment and grader but no example trajectory. They are extra environments for RL training.

At a glance

PropertyValue
Total RL tasks4,300
SFT trajectories1,682 (graded-correct rollouts of 1,682 of the RL tasks)
RL-only tasks2,618 (no example trajectory; fully runnable and scorable)
Task families102 distinct task types
Shared environment1 (toolathlon) — 32 MCP tool servers over one consistent world
Agent tools32 MCP servers — 26 mock service servers + 6 real local servers
Task formatSingle-turn: one user request, graded on final world state
GradingPer-task deterministic grader — programmatic, no LLM judge
Reward signal[0.0, 1.0]; passed = true only when every check passes (score == 1.0)
Suitable forTool-use SFT (with trajectories) · RL (verifier as reward) · benchmark

What’s inside

ComponentDescription
Shared environmentOne reusable world used by all 4,300 tasks — 32 tool servers launched as local subprocesses, backed by ~32 GB of shared fixture data. Each task overlays its own mock data on top.
TasksOne folder per task, each containing a task spec (prompts, visible tools, limits), an environment directory (initial workspace + mock overlays), and a verification directory (grader + answer key).
SFT trajectories1,682 rollouts in OpenAI chat format with tools, messages (system / user / assistant / tool), and metadata.
Reward verifiersA deterministic grader per task. Checks are concrete: required output files, exact CSV/JSON content, correct tool-call effects, expected mock-service end state.
Reference runnerA complete runner script that brings up the environment, drives a model, and scores the result — usable as-is or as a template for custom agent loops.

Task categories

The 102 task families span a wide range of real-world workflows. Each family is instantiated across many scenario variants, producing the 4,300 total tasks. The environment provides 32 MCP tool servers: 26 mock service servers (Canvas, email, WooCommerce, GitHub, Git, Hugging Face, Notion, Google Sheets/Calendar/Forms/Maps, Snowflake, BigQuery, Weights & Biases, arXiv, Scholarly, Yahoo Finance, YouTube, HowToCook, Memory, Kubernetes, Railway 12306, Fetch, Playwright) and 6 real local servers (filesystem, terminal, Excel, Word, PowerPoint, PDF). A task only launches the servers it needs. Everything is local; no network required. Representative families:
DomainExample families
Education & LMScanvas-do-quiz, canvas-arrange-exam, canvas-homework-grader-python, canvas-art-quiz, canvas-submit-late-work, course-schedule, course-assistant
Research & academiafind-alita-paper, cvpr-research, paper-checker, academic-pdf-report, add-bibtex, apply-phd-email
Data & MLimagenet, llm-training-dataset, verl-dataset, merge-hf-datasets, logical-datasets-collection, ab-testing
Business & financeinvoice-org, payable-invoice-checker, sales-accounting, investment-decision-analysis, flagged-transactions, travel-expense-reimbursement
Commercewoocommerce-new-welcome, woocommerce-product-recall, woocommerce-stock-alert, woocommerce-customer-survey
Productivitymeeting-assign, interview-report, reimbursement-form-filler, arrange-workspace, fillout-online-forms
Developer workflowsgit-repo, git-milestone, personal-website-construct, youtube-repo
Document & analysisppt-analysis, excel-data-transformation, excel-market-research, detect-revised-terms, privacy-desensitization

Training utility

Supervised fine-tuning (SFT) a smaller open-weight model on Toolathlon trajectories yields strict-pass gains on the held-out Toolathlon benchmark108 expert-authored tasks (the finalpool split) run against 600+ real tools across ~10 live services (Google Workspace, Snowflake, email, GitHub, Canvas LMS, Notion, WooCommerce, Kubernetes, arXiv, and a local filesystem). Each task ships a curated initial state in containerized services; grading scripts are injected only after the agent finishes, eliminating leakage. Training data. ~341 SFT samples in single-turn, multi-step messages format with per-turn reasoning_content (100% coverage). The system prompt inlines the same MCP tool signatures the benchmark exposes. Trajectories are genuinely long-horizon and tool-heavy — median 13 assistant turns (p25–p75: 9–21) and 23 tool calls (14–31) per sample — spanning shell, files, Snowflake SQL, email, GitHub, Canvas, and Notion. Base model: Qwen3.6-27B. Metric. Strict pass@1 (a task passes only if all post-hoc verification checks pass), temperature = 0.6. Results.
ModelPassedpass@1Config
Baseline26/10824.0%
Toolathlon-SFT30/10827.8%+3.8 pts
What drove it. The newly solved tasks are the long-horizon, cross-service workflows that chain many tool calls across several systems and demand an exactly-formatted final artifact — precisely the pattern the training data models. The gains span four distinct skill types:
  • Git-history forensics + communication (git-bug-hunt): walk a repo’s history to find the earliest commit that introduced a given variable, extract its hash, author, and full message, format them against a template, and email the author with an exact required subject line — a chain across git → filesystem → email that must stay correctly ordered end-to-end.
  • Structured data analysis → report generation (nhl-b2b-analysis): parse a full season schedule spreadsheet, compute each team’s back-to-back sets broken down across all four home/away configurations, and emit a new spreadsheet with exact headers.
  • Conditional, policy-branching execution (ab-testing): analyze BigQuery clickstream data, compute overall conversion rates, and then branch correctly — create a Cloud Storage bucket if B wins, or write a specific log entry otherwise — getting both the computation and the conditional side-effect right.
  • Document extraction → structured fill (reimbursement-form-filler): pull fields out of PDF receipts, fill a fixed Excel template without disturbing its layout, and rename the output — a parse → compute → format-and-write pattern that repeats across e-commerce recall workflows and LMS grading tasks as well.
Across all of these, SFT sharpens which tools the model calls, in what order, and how precisely it formats the final output — teaching it to carry a long multi-service chain through to completion rather than stalling or misordering steps partway.

Access & licensing

The full Toolathlon corpus — all environments, tasks, reward verifiers, and SFT trajectories — is available for commercial licensing, including model training. For licensing, contact support@eigenai.com. A free demo sample is available now under the CC BY-NC-ND 4.0 license — see Demo Samples.