Want to try it first? A free, individually-audited 20-task sample (10 operations + 10 investigation) is available on the Demo Samples page.
What Enterprise Bench is
A simulated company in Enterprise Bench is not a single API — it is a federation of enterprise systems over a shared database. Salesforce, Slack, Linear/Jira, Notion, GitHub, PagerDuty, Stripe, Grafana, SonarQube, Optimizely, LaunchDarkly, Greenhouse, Lattice, and more are all mounted as MCP tool servers backed by the same coherent world: the Linear ticket Slack is discussing is the same ticket the PagerDuty incident references, the Salesforce account the email mentions is the one billing tracks in Stripe. An agent works the task with native tool calls — searching, reading, and (on write tasks) mutating records across systems — exactly as a human operator would move between browser tabs. Tasks fall into two complementary families, both set in these enterprise environments:- Top-down — operations & coordination (WRITE). A persona-driven, multi-turn request to get something done across several systems: triage a CVE across Linear and Slack, onboard a customer across Salesforce/Stripe/Calendar, run a pre-launch readiness check and start an experiment. The agent discovers context across systems, then executes coordinated writes. Graded on the final state of the world — a DB-diff against a golden delta plus an LLM rubric of goal / process criteria.
- Bottom-up — investigation & QA (READ-ONLY). An answer-mined question over the same environments — “has this feature flag ever actually been enabled?”, “are these two engineers duplicating work?”, “how many of these incidents are unresolved and are they all linked to one ticket?” The agent investigates with read-only tools and returns a grounded, cited answer. Graded on answer recall against mined ground truth (entity / fact / source recall, minus a distractor penalty).
At a glance
| Property | Value |
|---|---|
| Enterprise environments | 27 industry verticals — each a self-contained simulated company |
| Connected systems | up to ~40 enterprise SaaS systems per environment (Salesforce, Slack, Linear/Jira, Notion, GitHub, PagerDuty, Stripe, Grafana, SonarQube, Optimizely, LaunchDarkly, Greenhouse, Lattice, Snyk, Zendesk, Google Workspace, …), all over one shared world state |
| Samples | 10,800 — task + environment + reward verifier, usable for SFT or RL; SFT-ready samples additionally ship a successful agent trajectory |
| Task families | 2 — top-down (multi-system writes), bottom-up (read-only QA) |
| Task / answer formats | write-and-coordinate; QA in factual / count / comparison / list / explanation |
| Grading | top-down: DB-diff against a golden delta + LLM rubric (goal / process / policy criteria). bottom-up: recall metrics (entity / fact / source recall) − distractor penalty |
| Layout | tau-bench four-folder (intent / datapoints / evaluators / reference_payloads), one NNNNNN id per sample |
What’s inside
| Component | Description |
|---|---|
| Tasks | A persona, a goal, and constraints (top-down) or an investigative question with answer type, difficulty, and required reasoning (bottom-up). Most carry the evidence chain / gold sources the answer or write depends on. |
| Environments | A self-contained simulated company per environment — up to ~40 SaaS systems over one shared relational world, so cross-system references resolve consistently. |
| Reward verifiers | An executable scorer per task. Top-down diffs the final DB state against a golden delta and runs LLM rubrics; bottom-up scores the cited answer with recall metrics. Both return a reward in [0, 1]. |
| Trajectories | Agent rollouts with chain-of-thought reasoning and native tool calls — SFT-ready samples ship a successful (passing) rollout. |
| Tool schemas | The action space of the systems a sample operates in — every tool of every MCP the trajectory touched, in OpenAI function format (not just the tools a given rollout happened to call). |
Task categories
Enterprise Bench is organized by enterprise environment: each of the 27 verticals below is a distinct simulated company, and serves as a case study for the kind of cross-system enterprise work an agent must handle. Each environment is realized as a concrete named company (theai_ml_startup world ships as Lattice AI, ecommerce_scale as Cartable, mobility_fleet as RideGrid, …), populated with its own people, teams, tickets, documents, incidents, accounts, experiments, and feature flags — so a task at Lattice AI is about INT8 quantization rollbacks and GPU quota, while one in freight_logistics is about denied-party screening and shipment latency.
| Sector | Environments |
|---|---|
| Software & infrastructure | ai_ml_startup, b2b_saas_startup, dev_tools_company, oss_data_infra, nexus_cloud_platform, cybersecurity_vendor |
| Payments & fintech | payments_api, enterprise_fintech, insurance_carrier |
| Commerce & marketplaces | ecommerce_scale, omnichannel_retail, proptech_marketplace, travel_booking |
| Media & consumer | streaming_media, gaming_studio, creative_agency, edtech_platform |
| Regulated & public sector | govtech_platform, healthcare_saas, legaltech_saas, nonprofit_ngo, biotech_rnd |
| Logistics & physical ops | freight_logistics, mobility_fleet, climate_iot_hardware, regional_telecom |
| Professional services | consulting_firm |
Expanding the categories
The 27 environments are case studies, not a closed set. Each is a world + tool servers + task generators + reward verifiers, so the corpus grows along three axes:- More verticals — manufacturing/ERP ops, banking back-office, clinical/EHR workflows, retail POS and supply chain, telco network operations, and other industries, each with its own systems and domain content.
- More task families — beyond top-down writes and bottom-up QA: scheduled / long-running operations, approval and escalation chains, policy-and-compliance enforcement, cross-system reconciliation, and incident-response runbooks.
- Deeper system coverage — pulling more of each environment’s ~40 connected systems into tasks (HR/Workday, IaC/ArgoCD, secrets/Bitwarden, observability/Datadog/Sentry, and the wider doc and CRM surface).
Difficulty profile
Bottom-up tasks carry an explicit difficulty label and reasoning tags; the distribution skews hard by design (counts from the evaluated read-only slice):| Tier | Share | What scales |
|---|---|---|
| Medium | 11% | A single cross-system hop; the answer is reachable once the right two systems are joined. |
| Hard | 65% | Multi-hop joins, near-duplicate records to disambiguate, contradiction between a config’s nominal and effective state. |
| Very hard | 24% | Deeper chains, aggregation over scattered records, and answer-mined facts that only a thorough sweep surfaces. |
How challenging is the data
As a reference point, a frontier-scale open-weight model — Qwen3.5-397B (qwen3-5-397b) — was evaluated on the bottom-up read-only QA family: 1,932 answer-mined questions across all 27 environments, scored both by each task’s recall verifier and by an independent LLM rubric.
Headline (LLM-judge rubric, reward ∈ [0, 1]):
| Metric | Value |
|---|---|
| Mean rubric reward | 0.558 (median 0.605) |
| Mean recall reward (entity/fact/source − distractors) | ≈ 0.40 |
| Strict pass (rubric = 1.0) | 13.6% (262 / 1,932) |
| Reward ≥ 0.9 | 24.4% |
| Reward = 0.0 (total miss) | 12.0% |
| Answer type | Tasks | Mean reward |
|---|---|---|
explanation | 702 | 0.607 |
factual | 1,150 | 0.550 |
count | 30 | 0.433 |
comparison | 29 | 0.165 |
list | 21 | 0.095 |
| Hardest | Mean | Easiest | Mean | |
|---|---|---|---|---|
nonprofit_ngo | 0.434 | omnichannel_retail† | 0.884 | |
regional_telecom | 0.445 | mobility_fleet | 0.684 | |
creative_agency | 0.491 | govtech_platform | 0.667 | |
enterprise_fintech | 0.501 | b2b_saas_startup | 0.652 | |
nexus_cloud_platform | 0.501 | healthcare_saas | 0.629 |
omnichannel_retail is a small slice (n=13); treat its mean as indicative, not robust.
Why it loses points. A second-pass audit of the non-perfect answers (given the gold and the docked criteria) found the misses are overwhelmingly real agent errors, not grader noise — three failure modes dominate:
- Wrong-similar-entity (≈39%) — the model grabs a near-duplicate of the right record (the adjacent ticket, the other engineer, PR #2 instead of PR #4). The environments deliberately plant look-alikes that demand ID / assignee / status cross-verification.
- Retrieval-gives-up (≈28%) — when one tool returns thin data the model concludes “no history available” instead of trying the audit log, transition history, or another system where the fact actually lives.
- Fabrication (≈17%) — on sparse data the model invents plausible-but-unsupported evidence (a Slack message, a metric, an ID) rather than stating absence.
Trajectory length
The bottom-up tasks are read-only but genuinely investigative. The table below summarizes, for the evaluated read-only slice, the assistant turns (steps) and tool calls per rollout — shown as mean / median / p90, broken out by difficulty tier:| Tier | Steps (turns) | Tool calls |
|---|---|---|
| Medium | 9.1 / 10 / 12 | 17.5 / 16 / 30 |
| Hard | 8.7 / 10 / 12 | 16.7 / 17 / 28 |
| Very hard | 9.0 / 9 / 12 | 18.6 / 18 / 30 |
| All | 8.8 / 10 / 12 | 17.2 / 17 / 28 |
- Tool calls (~17) far outnumber assistant turns (~9) — the agent issues several searches and reads in parallel per turn.
- Trajectory length is roughly flat across tiers — the difficulty comes from which records to find and reconcile, not from longer rollouts.
- The top-down write family runs longer still — in the demo set, 21–49 turns and 10–26 tool calls across 4–5 systems each, since it interleaves discovery with coordinated writes and multi-turn confirmation.
Tool usage
The read-only investigation family is search-dominated: the agent sweeps chat, ticketing, docs, and code, then reads the specific records it needs. Across the 1,932-task slice, tool calls concentrate in:- Chat (Slack) — ≈29% of all calls;
slack__search,slack__read_channel,slack__list_channels. - Ticketing (Jira / Linear / Asana / Zendesk) —
jira__search_tickets,jira__get_ticket,linear__search_tickets,zendesk__search_tickets. - Docs (Confluence / Notion) —
confluence__search_docs,notion__search_docs. - Code & incidents (GitHub / PagerDuty) —
github__list_pull_requests,github__search_code, plus incident lookups.
linear__link_tickets, salesforce__add_note / log_activity, Stripe customer creation, notion__add_space_member, calendar writes — and is graded by the resulting database delta, not by the answer text.