Skip to main content
These demo samples are a free 20-sample slice of Enterprise Bench, organized in the tau-bench four-folder layout so you can open one sample, read the task, follow the agent trajectory, run the reward verifier, and trust the ground truth — with zero machine-specific paths. They come as two packages of 10, one per task family:
  • top_down/ — 10 multi-system operations & coordination (WRITE) trajectories, generated by EigenData-CLI.
  • bottom_up/ — 10 read-only investigation & QA trajectories, mined from the same environments and rolled out by EnvSim.
For the complete dataset — its scale, the 27 environments, and benchmark results — see the Full Dataset page. Every shipped sample was first coarse-filtered by signal and then delivered by inspection — each trajectory read end-to-end by a skeptical auditor against four checks before release (see Verification).

Overview

PropertyValue
Packagestop_down (10 write trajectories) + bottom_up (10 read-only QA), 000001000010 each
Environmentstop_down: named companies (Lattice AI, Cartable, NexusTech, RideGrid, …). bottom_up: the 27-vertical keys (ai_ml_startup, govtech_platform, mobility_fleet, …)
LayoutFour folders (intent / datapoints / evaluators / reference_payloads) joined by a shared NNNNNN id
Gradingtop_down: DB-diff vs golden delta + LLM rubric. bottom_up: inlined recall scorer, self-contained, with a __main__ self-check that reproduces the reference reward
Source (bottom_up)EnvSim rollouts, qwen3-5-397b, perfect-score set; verbatim turns, reconstructed system preamble
Tool schemaPer sample, every tool of every MCP the trajectory touched (the action space, not a call log)

Data

enterprise/
├── top_down/                 # 10 multi-tool WRITE-task trajectories
│   ├── intent/               000001.json … 000010.json
│   ├── datapoints/           000001.json … 000010.json
│   ├── evaluators/           evaluator_000001.py … evaluator_000010.py
│   └── reference_payloads/   reference_payload_000001.json … reference_payload_000010.json
└── bottom_up/                # 10 read-only QA trajectories
    ├── intent/               000001.json … 000010.json
    ├── datapoints/           000001.json … 000010.json
    ├── evaluators/           evaluator_000001.py … evaluator_000010.py
    └── reference_payloads/   reference_payload_000001.json … reference_payload_000010.json
Both packages use the same four folders, and the same NNNNNN id refers to one sample across all four files. The task’s environment (vertical) is a field inside intent/ ("simulation": "ai_ml_startup"), not a directory.

What each folder holds

FolderRoletop_down (WRITE)bottom_up (READ-ONLY QA)
intent/the taskgoal / persona / motivations / constraintsquestion / answer_type / difficulty / reasoning_required / hop_count
datapoints/the SFT trajectoryformatted_dialog (system / user / assistant +reasoning +tool_calls / tool) + tool_schemasame shape, normalized from the rollout
evaluators/the reward scriptREFERENCE_DIFF + RUBRICS (DB-diff + LLM rubric)inlined gold + inlined recall scorer (self-contained, __main__ self-check)
reference_payloads/the ground truthgolden DB snapshots (per-MCP table counts + the changed rows, with content)gold_answer + gold_sources + the specific evidence[] records the answer depends on
The one semantic adaptation is reference_payloads. A write task has a golden DB delta — the exact rows it should insert/update/delete. A read-only QA task never mutates the DB, so its analog is the gold answer plus the specific evidence records it depends on; the full simulation DB is not shipped, because only the touched records matter for grading.
A top_down reference_payload is the golden delta, not a full DB dump: per system it holds initial_state / final_state table row-counts and a finalization.diff of the rows the trajectory changed, with content. Records the agent only read are absent — they live in the source environment, not the payload.

Operations samples (top_down)

IDCompanyGoal (abridged)SystemsTurns / calls
000001Lattice AIUpdate the Model-Rollback Guru card with INT8 quantization steps, then propagate to related Notion docs and GitHub issuesfigma, github, greenhouse, guru, notion49 / 26
000002Lattice AIvLLM CVE triage: link INF-19/INF-21 to INF-18 as duplicates and PLAT-11 as related, create a CRM contact, post to #general, set Slack status, start the batch experiment + record the baselinegmail, linear, optimizely, salesforce, slack30 / 19
000003Ripple AnalyticsInvestigate the Orbitra Health re-engagement situation across systems, then file a Salesforce support case with the discovered contextamplitude, guru, pagerduty, salesforce, sonarqube31 / 14
000004CartablePre-launch readiness check on the checkout experiment, then launch it, record the 0% coverage gap, and set the SonarQube gate to WARNgmail, jira, lattice, optimizely, sonarqube21 / 12
000005Lattice AIBilling-RFC prep: verify cross-system readiness, grant Fiona Notion access, flag GPU-quota risk on the CRM account, log the prep activity, check the calendargoogle_calendar, grafana, notion, salesforce, sonarqube25 / 10
000006Bridgewater ReliefCompile a security audit note on a PagerDuty incident, pulling Notion policy, a Snyk review, and Stripe billing statusbitwarden, notion, pagerduty, snyk, stripe29 / 17
000007NexusTechCreate a Q1-2026 Lattice objective for Observability, nested under an existing objective, after a chain of readiness checksamplitude, greenhouse, lattice, optimizely, salesforce25 / 15
000008RideGridOnboard Lakeside Shuttle Co. across billing, CRM, and calendar; handle a conflicting meeting by attempted-delete-then-fallbackargocd, google_calendar, salesforce, snyk, stripe22 / 12
000009CartableCreate a Cyber-Monday-prep Google Doc synthesizing post-Black-Friday status from accessible sources; add a CRM note; set a Slack statusgoogle_drive, optimizely, salesforce, slack24 / 15
000010Forge CLIReview a migration-guide PR (#6) with customer-impact context, set a Slack status, start an experiment, create a weekend hotfix on-call schedulegithub, optimizely, pagerduty, salesforce, slack29 / 15
Each task is multi-turn (the persona confirms write actions along the way) and spans 4–5 systems. Each company is a concrete instance of an environment type — Lattice AI (AI startup), Cartable (e-commerce), RideGrid (mobility), Forge CLI (dev tools), Bridgewater Relief (nonprofit).
Persona: Jasmine Wu, Head of Product / Co-founder at an AI-inference startup.Goal:
Prepare for the Monday billing RFC sync with Elena by gathering a cross-system readiness picture (code quality of the PLAT project, GPU capacity from Grafana, affected-account status in Salesforce), then grant Fiona Walsh access to the Product Notion space so she can review the GPU-quota meeting notes, update the DocuSearch AI Salesforce account to reflect the GPU-quota risk, log the prep activity against that account, and find a slot to reschedule the sync if the calendar conflicts.
Rubric (8 criteria — goal / process / process_order):
#TypeCriterion
R1goalFiona Walsh is added to the Product Notion space
R2goalThe DocuSearch AI Salesforce account reflects the GPU-quota risk
R3goalA prep activity is logged against that account
R4processPLAT code-quality metrics are checked in SonarQube
R5processThe GPU Cluster Health dashboard is retrieved from Grafana
R6processThe GPU-quota meeting notes are found in Notion
R7goalThe calendar is checked for conflicts around the sync
R8process_orderThe account ID is discovered via search before any write against it
Golden delta (reference_payload): crm_notes +1 (GPU-risk note), crm_activities +1 (prep activity), document_spaces +1 member (Fiona) — graded against REFERENCE_DIFF.
Assistant turns carry reasoning (chain-of-thought); reasoning is shown in italics.User:
I want to make sure I’m prepped before my billing RFC sync with Elena on Monday… [readiness checks, then grant Fiona access, flag the GPU risk on the account, log the prep, check my calendar].
Step 1 — Gather readiness context (3 parallel reads):Tool-first; the request spans multiple systems. Note: there’s no monitoring tool for GPU cluster health yet, so I should not invent that data.
sonarqube__get_quality_gate({"project": "PLAT"})
notion__search_docs({"query": "GPU quota meeting notes", "space": "Product"})
salesforce__search_records({"query": "DocuSearch AI", "object_type": "Account"})
So far: PLAT’s SonarQube gate is passing (0 bugs / 0 vulns / 0 smells) but coverage is 0.0%, so the readiness signal is weak. I found the Product Notion page Fiona needs, and the DocuSearch AI account.
User: Yes, go ahead with both. And try Grafana for the GPU Cluster Health dashboard.Step 2 — First write + Grafana grounding:
notion__add_space_member({"space_key": "Product", "user": "Fiona Walsh"})
grafana__search_dashboards({"query": "GPU Cluster Health"})
grafana__get_dashboard({"uid": "f8a59a48ebaf"})
Step 3 — Flag the risk on the discovered account (write after search — satisfies R8):
salesforce__add_note({"parent_id": "001CAF4D19FBA0F",
  "title": "GPU quota risk flagged before customer outreach", "body": "Flagging the 30% …"})
Done — Fiona has Product Notion access, and I flagged the quota risk on the DocuSearch AI account with the billing/GPU context.
User: Did you also log the prep activity for audit purposes? And check my calendar for conflicts.Step 4 — Calendar check (no fabrication):
google_calendar__search_events({"query": "billing RFC sync Elena"})
google_calendar__list_events({"time_min": "2024-01-08T13:00:00Z", "time_max": "2024-01-08T13:30:00Z"})
Not yet — I added the note but haven’t logged a separate activity record. On the calendar I see only the Monday 13:00–13:30 UTC sync, no conflict.
User: Yes, log that activity. No reschedule needed — thanks.Step 5 — Final write:
salesforce__log_activity({"what_id": "001CAF4D19FBA0F",
  "subject": "Pre-outreach prep review completed for billing/GPU quota risk", "description": "…"})
All set — prep activity logged on the DocuSearch AI account, and the Monday sync is conflict-free.
Note the agent stays tool-first, performs writes only after the user confirms them, and — when asked for GPU health with no monitoring tool available — states the limitation instead of fabricating numbers.

Investigation samples (bottom_up)

IDEnvironmentQuestion (abridged)TypeReasoningHops
000001creative_agencyStatus of all incidents tied to the Asana API rate-limit issue staling Tapestry burn data — how many unresolved, all linked to tracking tickets?countaggregation, cross-system2
000002nonprofit_ngoHow many active incidents are open for the FieldPulse PostgreSQL latency issue — any resolved, all linked to one ticket?countaggregation, cross-system2
000003b2b_saas_startupDuplicate tickets around PostgreSQL event-storage / batch-ingestion — are Lin Wei and Ravi Patel coordinating or siloed?factualduplicate_effort, multi-hop, temporal3
000004govtech_platformAre Cameron Reyes and Christine Park coordinating on civicflow-workflow-engine work, or do their tickets overlap silently?factualduplicate_effort, multi-hop, temporal3
000005mobility_fleetWho owns the AV Operations Safety & Data Policy, and who’s the dev on the geofence-management API that ties into it?factualownership_resolution, cross-system2
000006ai_ml_startupCurrent state of the batch-inference-v2 feature flag — enabled/configured, or created and left default-off?factualcontradiction, cross-system2
000007biotech_rndIs the Forge Scoring Model v3 flag actually rolling out, or created without configuration?factualcontradiction, cross-system2
000008edtech_platformState of new_assessment_engine — has anyone enabled/configured it? Check the audit history.factualcontradiction, cross-system2
000009freight_logisticsStatus of the automated denied-party-screening flag — ever enabled/modified, or no audit history?factualcontradiction, cross-system2
000010proptech_marketplaceActual status of new_search_ui — ever toggled / rollout %, or sitting since creation?factualcontradiction, cross-system2
The flag-state questions (000006000010) share a contradiction pattern: a flag can exist yet be effectively off, so the agent must reconcile the flag’s nominal config against its audit log rather than trust either alone.
Question:
What’s the current state of the batch-inference-v2 feature flag? Has it actually been enabled or configured for any users, or was it just created and left in a default-off state?
Answer type: factual · Difficulty: hard · Reasoning: contradiction, cross-system · Hops: 2Gold answer:
The flag batch-inference-v2 is effectively disabled — created by System but never enabled or configured with rollout targeting, so the feature is unreachable for all users until targeting rules or an explicit enable are applied.
Gold sources: batch-inference-v2 (flag record + audit log). Planted distractors: GPU-POLICY, ONCALL-POLICY, and three unrelated record ids — citing any of them incurs the distractor penalty.
System (reconstructed — see Provenance):
You are an enterprise operations analyst with read-only access to the company’s internal systems… Ground every claim in tool results; cite the source records (IDs/keys) you relied on. If the evidence is insufficient, say so rather than guess.
User:
What’s the current state of the batch-inference-v2 feature flag? …
Step 1 — Read the flag:Retrieve the flag details to check its current state and configuration.
launchdarkly__get_flag({"flag_key": "batch-inference-v2"})
Tool result:
{"flag_key": "batch-inference-v2", "kind": "boolean",
 "environments_json": {"production": {"on": false}, "staging": {"on": false}},
 "created_by": "System", "created": "2026-05-23 05:08:02", "modified": "2026-05-23 05:08:02"}
Step 2 — Don’t trust the config alone; check the audit log:It’s off in both envs, but I need the audit log to confirm whether it was ever toggled or just created.
launchdarkly__list_audit_log({"flag_key": "batch-inference-v2", "limit": 50})
Tool result: one entry — create by System at 2026-05-23 05:08:02, no later modifications.Step 3 — Final answer:
The batch-inference-v2 flag was created and left in a default-off state. Both production and staging show on: false; the audit log has a single create action by System on 2026-05-23 with no subsequent toggle or config change (created and modified timestamps are identical).
The agent resolves the contradiction correctly — an existing flag is not an enabled flag — by cross-checking the config against the audit log, exactly what the task probes.
The evaluator inlines the recall scorer from EnvSim and exposes the live contract:
evaluator.score(agent_answer, agent_sources, agent_distractor_ids=None)  # -> reward in [0,1]
Re-scoring the shipped reference answer reproduces the reference reward:
{ "entity_recall": 1.0, "fact_recall": 1.0, "source_recall": 1.0,
  "distractor_penalty": 0.0, "reward": 0.8 }
0.8 is a perfect score here. Per EnvSim’s scoring, a perfect factual answer maxes at 0.80 (a perfect count at 0.85) at difficulty=hard — these are the weight ceilings, not partial credit. Running python3 evaluator_000006.py materializes this self-check and confirms the scorer reproduces reference_reward.

Scoring note

Each bottom_up/evaluator_NNNNNN.py is self-contained — the recall-scoring functions are inlined from EnvSim’s eval_scoring.py — and ships a __main__ self-check that re-scores the reference answer and reproduces its reward. Full marks are family-capped: factual → 0.80, count → 0.85 at difficulty=hard (weight ceilings × the hard multiplier of 1.0), so a shipped reference_reward of 0.8 is a perfect answer, not a near-miss. The top_down evaluators grade the realized world: a DB-diff against REFERENCE_DIFF plus the LLM RUBRICS.

Verification

Sampled by signal, delivered by inspection. Candidates were first coarse-filtered (top_down: non-LQ validation reports scoring 1.0; bottom_up: the perfect-score rollout set), then every delivered trajectory was read end-to-end by an auditing sub-agent against four checks: the task actually completed; no hallucinated tool results; every argument is grounded in a real tool result; and the evaluator is sound — would a wrong run actually fail? One bottom_up “perfect” candidate (dev_tools_company, EVAL-70B9C2) was rejected as a false positive — its gold answer was contradicted by the environment — and replaced. The remaining 20 passed.

Provenance

  • top_down turns are generated trajectories; tool calls and arguments are grounded in the simulated environment’s real responses.
  • bottom_up user / assistant / tool turns and the final answer are verbatim from the EnvSim rollout. Only the system turn is a faithful reconstruction — the rollout records don’t store the original system prompt, so the read-only-analyst preamble is reproduced, not invented from the answer.

Download

# Download Enterprise Bench demo data
hf download jindidi/eigendata-demo-data --repo-type dataset --include "enterprise/*"

Browse on Hugging Face

View Enterprise Bench files
For the complete Enterprise Bench corpus — its 27 environments, difficulty profile, and benchmark results — see the Full Dataset page.