Skip to main content
Enterprise Bench is a corpus of long-horizon, tool-using agent tasks set inside realistic simulated companies, synthesized by EigenData-CLI. Each task drops an agent into a self-contained enterprise environment — one company backed by up to ~40 connected SaaS systems (CRM, ticketing, docs, chat, code, observability, billing, HR) that all share one consistent world state — and asks it either to operate the business (multi-system write and coordination tasks) or to answer questions about it (read-only investigative QA). The full corpus spans 27 enterprise environments and 10,800 samples. Each sample bundles a task, the environment, a reward-verification script, and — where available — a successful agent trajectory, usable for either supervised fine-tuning or reinforcement learning. The difficulty, trajectory, and tool-usage sections below characterize the bottom-up read-only QA family, the slice evaluated end-to-end.
Want to try it first? A free, individually-audited 20-task sample (10 operations + 10 investigation) is available on the Demo Samples page.

What Enterprise Bench is

A simulated company in Enterprise Bench is not a single API — it is a federation of enterprise systems over a shared database. Salesforce, Slack, Linear/Jira, Notion, GitHub, PagerDuty, Stripe, Grafana, SonarQube, Optimizely, LaunchDarkly, Greenhouse, Lattice, and more are all mounted as MCP tool servers backed by the same coherent world: the Linear ticket Slack is discussing is the same ticket the PagerDuty incident references, the Salesforce account the email mentions is the one billing tracks in Stripe. An agent works the task with native tool calls — searching, reading, and (on write tasks) mutating records across systems — exactly as a human operator would move between browser tabs. Tasks fall into two complementary families, both set in these enterprise environments:
  • Top-down — operations & coordination (WRITE). A persona-driven, multi-turn request to get something done across several systems: triage a CVE across Linear and Slack, onboard a customer across Salesforce/Stripe/Calendar, run a pre-launch readiness check and start an experiment. The agent discovers context across systems, then executes coordinated writes. Graded on the final state of the world — a DB-diff against a golden delta plus an LLM rubric of goal / process criteria.
  • Bottom-up — investigation & QA (READ-ONLY). An answer-mined question over the same environments — “has this feature flag ever actually been enabled?”, “are these two engineers duplicating work?”, “how many of these incidents are unresolved and are they all linked to one ticket?” The agent investigates with read-only tools and returns a grounded, cited answer. Graded on answer recall against mined ground truth (entity / fact / source recall, minus a distractor penalty).

At a glance

PropertyValue
Enterprise environments27 industry verticals — each a self-contained simulated company
Connected systemsup to ~40 enterprise SaaS systems per environment (Salesforce, Slack, Linear/Jira, Notion, GitHub, PagerDuty, Stripe, Grafana, SonarQube, Optimizely, LaunchDarkly, Greenhouse, Lattice, Snyk, Zendesk, Google Workspace, …), all over one shared world state
Samples10,800 — task + environment + reward verifier, usable for SFT or RL; SFT-ready samples additionally ship a successful agent trajectory
Task families2 — top-down (multi-system writes), bottom-up (read-only QA)
Task / answer formatswrite-and-coordinate; QA in factual / count / comparison / list / explanation
Gradingtop-down: DB-diff against a golden delta + LLM rubric (goal / process / policy criteria). bottom-up: recall metrics (entity / fact / source recall) − distractor penalty
Layouttau-bench four-folder (intent / datapoints / evaluators / reference_payloads), one NNNNNN id per sample

What’s inside

ComponentDescription
TasksA persona, a goal, and constraints (top-down) or an investigative question with answer type, difficulty, and required reasoning (bottom-up). Most carry the evidence chain / gold sources the answer or write depends on.
EnvironmentsA self-contained simulated company per environment — up to ~40 SaaS systems over one shared relational world, so cross-system references resolve consistently.
Reward verifiersAn executable scorer per task. Top-down diffs the final DB state against a golden delta and runs LLM rubrics; bottom-up scores the cited answer with recall metrics. Both return a reward in [0, 1].
TrajectoriesAgent rollouts with chain-of-thought reasoning and native tool calls — SFT-ready samples ship a successful (passing) rollout.
Tool schemasThe action space of the systems a sample operates in — every tool of every MCP the trajectory touched, in OpenAI function format (not just the tools a given rollout happened to call).

Task categories

Enterprise Bench is organized by enterprise environment: each of the 27 verticals below is a distinct simulated company, and serves as a case study for the kind of cross-system enterprise work an agent must handle. Each environment is realized as a concrete named company (the ai_ml_startup world ships as Lattice AI, ecommerce_scale as Cartable, mobility_fleet as RideGrid, …), populated with its own people, teams, tickets, documents, incidents, accounts, experiments, and feature flags — so a task at Lattice AI is about INT8 quantization rollbacks and GPU quota, while one in freight_logistics is about denied-party screening and shipment latency.
SectorEnvironments
Software & infrastructureai_ml_startup, b2b_saas_startup, dev_tools_company, oss_data_infra, nexus_cloud_platform, cybersecurity_vendor
Payments & fintechpayments_api, enterprise_fintech, insurance_carrier
Commerce & marketplacesecommerce_scale, omnichannel_retail, proptech_marketplace, travel_booking
Media & consumerstreaming_media, gaming_studio, creative_agency, edtech_platform
Regulated & public sectorgovtech_platform, healthcare_saas, legaltech_saas, nonprofit_ngo, biotech_rnd
Logistics & physical opsfreight_logistics, mobility_fleet, climate_iot_hardware, regional_telecom
Professional servicesconsulting_firm
Both task families instantiate across these worlds, giving broad capability coverage without re-using surface content.

Expanding the categories

The 27 environments are case studies, not a closed set. Each is a world + tool servers + task generators + reward verifiers, so the corpus grows along three axes:
  • More verticals — manufacturing/ERP ops, banking back-office, clinical/EHR workflows, retail POS and supply chain, telco network operations, and other industries, each with its own systems and domain content.
  • More task families — beyond top-down writes and bottom-up QA: scheduled / long-running operations, approval and escalation chains, policy-and-compliance enforcement, cross-system reconciliation, and incident-response runbooks.
  • Deeper system coverage — pulling more of each environment’s ~40 connected systems into tasks (HR/Workday, IaC/ArgoCD, secrets/Bitwarden, observability/Datadog/Sentry, and the wider doc and CRM surface).
Each addition ships the same way — a coherent world plus an executable reward verifier per task — so it is immediately RL-trainable and (with a passing rollout) SFT-ready.

Difficulty profile

Bottom-up tasks carry an explicit difficulty label and reasoning tags; the distribution skews hard by design (counts from the evaluated read-only slice):
TierShareWhat scales
Medium11%A single cross-system hop; the answer is reachable once the right two systems are joined.
Hard65%Multi-hop joins, near-duplicate records to disambiguate, contradiction between a config’s nominal and effective state.
Very hard24%Deeper chains, aggregation over scattered records, and answer-mined facts that only a thorough sweep surfaces.
Answer types are dominated by factual (60%) and explanation (36%), with a sharp tail of count, comparison, and list questions (≈4% combined) that prove the most punishing — they require complete, exhaustive enumeration rather than locating a single fact. Top-down tasks scale on a different axis: number of systems touched, number of coordinated writes, and the order constraints between them.

How challenging is the data

As a reference point, a frontier-scale open-weight model — Qwen3.5-397B (qwen3-5-397b) — was evaluated on the bottom-up read-only QA family: 1,932 answer-mined questions across all 27 environments, scored both by each task’s recall verifier and by an independent LLM rubric. Headline (LLM-judge rubric, reward ∈ [0, 1]):
MetricValue
Mean rubric reward0.558 (median 0.605)
Mean recall reward (entity/fact/source − distractors)≈ 0.40
Strict pass (rubric = 1.0)13.6% (262 / 1,932)
Reward ≥ 0.924.4%
Reward = 0.0 (total miss)12.0%
The global mean sits near 0.56, but the signal is in the variance. By answer type — locating one fact is tractable; exhaustive enumeration is not:
Answer typeTasksMean reward
explanation7020.607
factual1,1500.550
count300.433
comparison290.165
list210.095
By environment — capability is sharply uneven across verticals (selected; patched rubric mean):
HardestMeanEasiestMean
nonprofit_ngo0.434omnichannel_retail0.884
regional_telecom0.445mobility_fleet0.684
creative_agency0.491govtech_platform0.667
enterprise_fintech0.501b2b_saas_startup0.652
nexus_cloud_platform0.501healthcare_saas0.629
omnichannel_retail is a small slice (n=13); treat its mean as indicative, not robust. Why it loses points. A second-pass audit of the non-perfect answers (given the gold and the docked criteria) found the misses are overwhelmingly real agent errors, not grader noise — three failure modes dominate:
  • Wrong-similar-entity (≈39%) — the model grabs a near-duplicate of the right record (the adjacent ticket, the other engineer, PR #2 instead of PR #4). The environments deliberately plant look-alikes that demand ID / assignee / status cross-verification.
  • Retrieval-gives-up (≈28%) — when one tool returns thin data the model concludes “no history available” instead of trying the audit log, transition history, or another system where the fact actually lives.
  • Fabrication (≈17%) — on sparse data the model invents plausible-but-unsupported evidence (a Slack message, a metric, an ID) rather than stating absence.
A further ≈5% are format failures (empty / unparseable final answers). These tasks demand exact cross-system grounding, complete enumeration, and the discipline to state absence rather than invent — which a frontier-scale open model does not reliably deliver out of the box, and which is precisely what makes the corpus a strong training and evaluation signal.

Trajectory length

The bottom-up tasks are read-only but genuinely investigative. The table below summarizes, for the evaluated read-only slice, the assistant turns (steps) and tool calls per rollout — shown as mean / median / p90, broken out by difficulty tier:
TierSteps (turns)Tool calls
Medium9.1 / 10 / 1217.5 / 16 / 30
Hard8.7 / 10 / 1216.7 / 17 / 28
Very hard9.0 / 9 / 1218.6 / 18 / 30
All8.8 / 10 / 1217.2 / 17 / 28
  • Tool calls (~17) far outnumber assistant turns (~9) — the agent issues several searches and reads in parallel per turn.
  • Trajectory length is roughly flat across tiers — the difficulty comes from which records to find and reconcile, not from longer rollouts.
  • The top-down write family runs longer still — in the demo set, 21–49 turns and 10–26 tool calls across 4–5 systems each, since it interleaves discovery with coordinated writes and multi-turn confirmation.

Tool usage

The read-only investigation family is search-dominated: the agent sweeps chat, ticketing, docs, and code, then reads the specific records it needs. Across the 1,932-task slice, tool calls concentrate in:
  • Chat (Slack) — ≈29% of all calls; slack__search, slack__read_channel, slack__list_channels.
  • Ticketing (Jira / Linear / Asana / Zendesk)jira__search_tickets, jira__get_ticket, linear__search_tickets, zendesk__search_tickets.
  • Docs (Confluence / Notion)confluence__search_docs, notion__search_docs.
  • Code & incidents (GitHub / PagerDuty)github__list_pull_requests, github__search_code, plus incident lookups.
The top-down write family additionally exercises the mutating tools — linear__link_tickets, salesforce__add_note / log_activity, Stripe customer creation, notion__add_space_member, calendar writes — and is graded by the resulting database delta, not by the answer text.

Access & licensing

The full Enterprise Bench corpus — all environments, tasks, reward verifiers, and trajectories — is available for commercial licensing, including model training. For licensing, contact support@eigenai.com. A free 20-task sample is available now under the CC BY-NC-ND 4.0 license — see Demo Samples.