Demo Samples

These demo samples are a free 20-sample slice of Enterprise Bench, organized in the tau-bench four-folder layout so you can open one sample, read the task, follow the agent trajectory, run the reward verifier, and trust the ground truth — with zero machine-specific paths. They come as two packages of 10, one per task family:

top_down/ — 10 multi-system operations & coordination (WRITE) trajectories, generated by EigenData-CLI.
bottom_up/ — 10 read-only investigation & QA trajectories, mined from the same environments and rolled out by EnvSim.

For the complete dataset — its scale, the 27 environments, and benchmark results — see the Full Dataset page. Every shipped sample was first coarse-filtered by signal and then delivered by inspection — each trajectory read end-to-end by a skeptical auditor against four checks before release (see Verification).

Overview

Property	Value
Packages	`top_down` (10 write trajectories) + `bottom_up` (10 read-only QA), `000001`–`000010` each
Environments	`top_down`: named companies (Lattice AI, Cartable, NexusTech, RideGrid, …). `bottom_up`: the 27-vertical keys (`ai_ml_startup`, `govtech_platform`, `mobility_fleet`, …)
Layout	Four folders (`intent` / `datapoints` / `evaluators` / `reference_payloads`) joined by a shared `NNNNNN` id
Grading	`top_down`: DB-diff vs golden delta + LLM rubric. `bottom_up`: inlined recall scorer, self-contained, with a `__main__` self-check that reproduces the reference reward
Source (bottom_up)	EnvSim rollouts, `qwen3-5-397b`, perfect-score set; verbatim turns, reconstructed system preamble
Tool schema	Per sample, every tool of every MCP the trajectory touched (the action space, not a call log)

Data

enterprise/
├── top_down/                 # 10 multi-tool WRITE-task trajectories
│   ├── intent/               000001.json … 000010.json
│   ├── datapoints/           000001.json … 000010.json
│   ├── evaluators/           evaluator_000001.py … evaluator_000010.py
│   └── reference_payloads/   reference_payload_000001.json … reference_payload_000010.json
└── bottom_up/                # 10 read-only QA trajectories
    ├── intent/               000001.json … 000010.json
    ├── datapoints/           000001.json … 000010.json
    ├── evaluators/           evaluator_000001.py … evaluator_000010.py
    └── reference_payloads/   reference_payload_000001.json … reference_payload_000010.json

Both packages use the same four folders, and the same NNNNNN id refers to one sample across all four files. The task’s environment (vertical) is a field inside intent/ ("simulation": "ai_ml_startup"), not a directory.

What each folder holds

Folder	Role	`top_down` (WRITE)	`bottom_up` (READ-ONLY QA)
`intent/`	the task	goal / persona / motivations / constraints	question / `answer_type` / `difficulty` / `reasoning_required` / `hop_count`
`datapoints/`	the SFT trajectory	`formatted_dialog` (system / user / assistant `+reasoning` `+tool_calls` / tool) + `tool_schema`	same shape, normalized from the rollout
`evaluators/`	the reward script	`REFERENCE_DIFF` + `RUBRICS` (DB-diff + LLM rubric)	inlined gold + inlined recall scorer (self-contained, `__main__` self-check)
`reference_payloads/`	the ground truth	golden DB snapshots (per-MCP table counts + the changed rows, with content)	`gold_answer` + `gold_sources` + the specific `evidence[]` records the answer depends on

The one semantic adaptation is reference_payloads. A write task has a golden DB delta — the exact rows it should insert/update/delete. A read-only QA task never mutates the DB, so its analog is the gold answer plus the specific evidence records it depends on; the full simulation DB is not shipped, because only the touched records matter for grading.

A top_down reference_payload is the golden delta, not a full DB dump: per system it holds initial_state / final_state table row-counts and a finalization.diff of the rows the trajectory changed, with content. Records the agent only read are absent — they live in the source environment, not the payload.

Operations samples (`top_down`)

All 10 top_down tasks

ID	Company	Goal (abridged)	Systems	Turns / calls
`000001`	Lattice AI	Update the Model-Rollback Guru card with INT8 quantization steps, then propagate to related Notion docs and GitHub issues	figma, github, greenhouse, guru, notion	49 / 26
`000002`	Lattice AI	vLLM CVE triage: link INF-19/INF-21 to INF-18 as duplicates and PLAT-11 as related, create a CRM contact, post to #general, set Slack status, start the batch experiment + record the baseline	gmail, linear, optimizely, salesforce, slack	30 / 19
`000003`	Ripple Analytics	Investigate the Orbitra Health re-engagement situation across systems, then file a Salesforce support case with the discovered context	amplitude, guru, pagerduty, salesforce, sonarqube	31 / 14
`000004`	Cartable	Pre-launch readiness check on the checkout experiment, then launch it, record the 0% coverage gap, and set the SonarQube gate to WARN	gmail, jira, lattice, optimizely, sonarqube	21 / 12
`000005`	Lattice AI	Billing-RFC prep: verify cross-system readiness, grant Fiona Notion access, flag GPU-quota risk on the CRM account, log the prep activity, check the calendar	google_calendar, grafana, notion, salesforce, sonarqube	25 / 10
`000006`	Bridgewater Relief	Compile a security audit note on a PagerDuty incident, pulling Notion policy, a Snyk review, and Stripe billing status	bitwarden, notion, pagerduty, snyk, stripe	29 / 17
`000007`	NexusTech	Create a Q1-2026 Lattice objective for Observability, nested under an existing objective, after a chain of readiness checks	amplitude, greenhouse, lattice, optimizely, salesforce	25 / 15
`000008`	RideGrid	Onboard Lakeside Shuttle Co. across billing, CRM, and calendar; handle a conflicting meeting by attempted-delete-then-fallback	argocd, google_calendar, salesforce, snyk, stripe	22 / 12
`000009`	Cartable	Create a Cyber-Monday-prep Google Doc synthesizing post-Black-Friday status from accessible sources; add a CRM note; set a Slack status	google_drive, optimizely, salesforce, slack	24 / 15
`000010`	Forge CLI	Review a migration-guide PR (#6) with customer-impact context, set a Slack status, start an experiment, create a weekend hotfix on-call schedule	github, optimizely, pagerduty, salesforce, slack	29 / 15

Each task is multi-turn (the persona confirms write actions along the way) and spans 4–5 systems. Each company is a concrete instance of an environment type — Lattice AI (AI startup), Cartable (e-commerce), RideGrid (mobility), Forge CLI (dev tools), Bridgewater Relief (nonprofit).

Full task: Billing-RFC prep (top_down 000005)

Persona: Jasmine Wu, Head of Product / Co-founder at an AI-inference startup.Goal:

Prepare for the Monday billing RFC sync with Elena by gathering a cross-system readiness picture (code quality of the PLAT project, GPU capacity from Grafana, affected-account status in Salesforce), then grant Fiona Walsh access to the Product Notion space so she can review the GPU-quota meeting notes, update the DocuSearch AI Salesforce account to reflect the GPU-quota risk, log the prep activity against that account, and find a slot to reschedule the sync if the calendar conflicts.

Rubric (8 criteria — goal / process / process_order):

#	Type	Criterion
R1	goal	Fiona Walsh is added to the Product Notion space
R2	goal	The DocuSearch AI Salesforce account reflects the GPU-quota risk
R3	goal	A prep activity is logged against that account
R4	process	PLAT code-quality metrics are checked in SonarQube
R5	process	The GPU Cluster Health dashboard is retrieved from Grafana
R6	process	The GPU-quota meeting notes are found in Notion
R7	goal	The calendar is checked for conflicts around the sync
R8	process_order	The account ID is discovered via search before any write against it

Golden delta (reference_payload): crm_notes +1 (GPU-risk note), crm_activities +1 (prep activity), document_spaces +1 member (Fiona) — graded against REFERENCE_DIFF.

Agent trajectory (top_down 000005 — 25 turns, 10 tool calls, 5 systems)

Assistant turns carry reasoning (chain-of-thought); reasoning is shown in italics.User:

I want to make sure I’m prepped before my billing RFC sync with Elena on Monday… [readiness checks, then grant Fiona access, flag the GPU risk on the account, log the prep, check my calendar].

Step 1 — Gather readiness context (3 parallel reads):Tool-first; the request spans multiple systems. Note: there’s no monitoring tool for GPU cluster health yet, so I should not invent that data.

sonarqube__get_quality_gate({"project": "PLAT"})
notion__search_docs({"query": "GPU quota meeting notes", "space": "Product"})
salesforce__search_records({"query": "DocuSearch AI", "object_type": "Account"})

So far: PLAT’s SonarQube gate is passing (0 bugs / 0 vulns / 0 smells) but coverage is 0.0%, so the readiness signal is weak. I found the Product Notion page Fiona needs, and the DocuSearch AI account.

User: Yes, go ahead with both. And try Grafana for the GPU Cluster Health dashboard.Step 2 — First write + Grafana grounding:

notion__add_space_member({"space_key": "Product", "user": "Fiona Walsh"})
grafana__search_dashboards({"query": "GPU Cluster Health"})
grafana__get_dashboard({"uid": "f8a59a48ebaf"})

Step 3 — Flag the risk on the discovered account (write after search — satisfies R8):

salesforce__add_note({"parent_id": "001CAF4D19FBA0F",
  "title": "GPU quota risk flagged before customer outreach", "body": "Flagging the 30% …"})

Done — Fiona has Product Notion access, and I flagged the quota risk on the DocuSearch AI account with the billing/GPU context.

User: Did you also log the prep activity for audit purposes? And check my calendar for conflicts.Step 4 — Calendar check (no fabrication):

google_calendar__search_events({"query": "billing RFC sync Elena"})
google_calendar__list_events({"time_min": "2024-01-08T13:00:00Z", "time_max": "2024-01-08T13:30:00Z"})

Not yet — I added the note but haven’t logged a separate activity record. On the calendar I see only the Monday 13:00–13:30 UTC sync, no conflict.

User: Yes, log that activity. No reschedule needed — thanks.Step 5 — Final write:

salesforce__log_activity({"what_id": "001CAF4D19FBA0F",
  "subject": "Pre-outreach prep review completed for billing/GPU quota risk", "description": "…"})

All set — prep activity logged on the DocuSearch AI account, and the Monday sync is conflict-free.

Note the agent stays tool-first, performs writes only after the user confirms them, and — when asked for GPU health with no monitoring tool available — states the limitation instead of fabricating numbers.

Investigation samples (`bottom_up`)

All 10 bottom_up tasks

ID	Environment	Question (abridged)	Type	Reasoning	Hops
`000001`	creative_agency	Status of all incidents tied to the Asana API rate-limit issue staling Tapestry burn data — how many unresolved, all linked to tracking tickets?	count	aggregation, cross-system	2
`000002`	nonprofit_ngo	How many active incidents are open for the FieldPulse PostgreSQL latency issue — any resolved, all linked to one ticket?	count	aggregation, cross-system	2
`000003`	b2b_saas_startup	Duplicate tickets around PostgreSQL event-storage / batch-ingestion — are Lin Wei and Ravi Patel coordinating or siloed?	factual	duplicate_effort, multi-hop, temporal	3
`000004`	govtech_platform	Are Cameron Reyes and Christine Park coordinating on civicflow-workflow-engine work, or do their tickets overlap silently?	factual	duplicate_effort, multi-hop, temporal	3
`000005`	mobility_fleet	Who owns the AV Operations Safety & Data Policy, and who’s the dev on the geofence-management API that ties into it?	factual	ownership_resolution, cross-system	2
`000006`	ai_ml_startup	Current state of the `batch-inference-v2` feature flag — enabled/configured, or created and left default-off?	factual	contradiction, cross-system	2
`000007`	biotech_rnd	Is the Forge Scoring Model v3 flag actually rolling out, or created without configuration?	factual	contradiction, cross-system	2
`000008`	edtech_platform	State of `new_assessment_engine` — has anyone enabled/configured it? Check the audit history.	factual	contradiction, cross-system	2
`000009`	freight_logistics	Status of the automated denied-party-screening flag — ever enabled/modified, or no audit history?	factual	contradiction, cross-system	2
`000010`	proptech_marketplace	Actual status of `new_search_ui` — ever toggled / rollout %, or sitting since creation?	factual	contradiction, cross-system	2

The flag-state questions (000006–000010) share a contradiction pattern: a flag can exist yet be effectively off, so the agent must reconcile the flag’s nominal config against its audit log rather than trust either alone.

Full task: Feature-flag state (bottom_up 000006 — ai_ml_startup)

Question:

What’s the current state of the batch-inference-v2 feature flag? Has it actually been enabled or configured for any users, or was it just created and left in a default-off state?

Answer type: factual · Difficulty: hard · Reasoning: contradiction, cross-system · Hops: 2Gold answer:

The flag batch-inference-v2 is effectively disabled — created by System but never enabled or configured with rollout targeting, so the feature is unreachable for all users until targeting rules or an explicit enable are applied.

Gold sources: batch-inference-v2 (flag record + audit log). Planted distractors: GPU-POLICY, ONCALL-POLICY, and three unrelated record ids — citing any of them incurs the distractor penalty.

Agent trajectory (bottom_up 000006 — 7 turns, 2 tool calls, read-only)

System (reconstructed — see Provenance):

You are an enterprise operations analyst with read-only access to the company’s internal systems… Ground every claim in tool results; cite the source records (IDs/keys) you relied on. If the evidence is insufficient, say so rather than guess.

User:

What’s the current state of the batch-inference-v2 feature flag? …

Step 1 — Read the flag:Retrieve the flag details to check its current state and configuration.

launchdarkly__get_flag({"flag_key": "batch-inference-v2"})

Tool result:

{"flag_key": "batch-inference-v2", "kind": "boolean",
 "environments_json": {"production": {"on": false}, "staging": {"on": false}},
 "created_by": "System", "created": "2026-05-23 05:08:02", "modified": "2026-05-23 05:08:02"}

Step 2 — Don’t trust the config alone; check the audit log:It’s off in both envs, but I need the audit log to confirm whether it was ever toggled or just created.

launchdarkly__list_audit_log({"flag_key": "batch-inference-v2", "limit": 50})

Tool result: one entry — create by System at 2026-05-23 05:08:02, no later modifications.Step 3 — Final answer:

The batch-inference-v2 flag was created and left in a default-off state. Both production and staging show on: false; the audit log has a single create action by System on 2026-05-23 with no subsequent toggle or config change (created and modified timestamps are identical).

The agent resolves the contradiction correctly — an existing flag is not an enabled flag — by cross-checking the config against the audit log, exactly what the task probes.

Grading & self-check (bottom_up 000006 — reward 0.8 = full marks)

The evaluator inlines the recall scorer from EnvSim and exposes the live contract:

evaluator.score(agent_answer, agent_sources, agent_distractor_ids=None)  # -> reward in [0,1]

Re-scoring the shipped reference answer reproduces the reference reward:

{ "entity_recall": 1.0, "fact_recall": 1.0, "source_recall": 1.0,
  "distractor_penalty": 0.0, "reward": 0.8 }

0.8 is a perfect score here. Per EnvSim’s scoring, a perfect factual answer maxes at 0.80 (a perfect count at 0.85) at difficulty=hard — these are the weight ceilings, not partial credit. Running python3 evaluator_000006.py materializes this self-check and confirms the scorer reproduces reference_reward.

Scoring note

Each bottom_up/evaluator_NNNNNN.py is self-contained — the recall-scoring functions are inlined from EnvSim’s eval_scoring.py — and ships a __main__ self-check that re-scores the reference answer and reproduces its reward. Full marks are family-capped: factual → 0.80, count → 0.85 at difficulty=hard (weight ceilings × the hard multiplier of 1.0), so a shipped reference_reward of 0.8 is a perfect answer, not a near-miss. The top_down evaluators grade the realized world: a DB-diff against REFERENCE_DIFF plus the LLM RUBRICS.

Verification

Sampled by signal, delivered by inspection. Candidates were first coarse-filtered (top_down: non-LQ validation reports scoring 1.0; bottom_up: the perfect-score rollout set), then every delivered trajectory was read end-to-end by an auditing sub-agent against four checks: the task actually completed; no hallucinated tool results; every argument is grounded in a real tool result; and the evaluator is sound — would a wrong run actually fail? One bottom_up “perfect” candidate (dev_tools_company, EVAL-70B9C2) was rejected as a false positive — its gold answer was contradicted by the environment — and replaced. The remaining 20 passed.

Provenance

top_down turns are generated trajectories; tool calls and arguments are grounded in the simulated environment’s real responses.
bottom_up user / assistant / tool turns and the final answer are verbatim from the EnvSim rollout. Only the system turn is a faithful reconstruction — the rollout records don’t store the original system prompt, so the read-only-analyst preamble is reproduced, not invented from the answer.

Download

# Download Enterprise Bench demo data
hf download jindidi/eigendata-demo-data --repo-type dataset --include "enterprise/*"

Browse on Hugging Face

View Enterprise Bench files

For the complete Enterprise Bench corpus — its 27 environments, difficulty profile, and benchmark results — see the Full Dataset page.

Eigen AI

API Reference

Platform

Products

Overview

Data

What each folder holds

Operations samples (`top_down`)

Investigation samples (`bottom_up`)

Scoring note

Verification

Provenance

Download

Browse on Hugging Face

​Overview

​Data

​What each folder holds

​Operations samples (top_down)

​Investigation samples (bottom_up)

​Scoring note

​Verification

​Provenance

​Download

Browse on Hugging Face

Overview

Data

What each folder holds

Operations samples (`top_down`)

Investigation samples (`bottom_up`)

Scoring note

Verification

Provenance

Download