top_down/— 10 multi-system operations & coordination (WRITE) trajectories, generated by EigenData-CLI.bottom_up/— 10 read-only investigation & QA trajectories, mined from the same environments and rolled out by EnvSim.
Overview
| Property | Value |
|---|---|
| Packages | top_down (10 write trajectories) + bottom_up (10 read-only QA), 000001–000010 each |
| Environments | top_down: named companies (Lattice AI, Cartable, NexusTech, RideGrid, …). bottom_up: the 27-vertical keys (ai_ml_startup, govtech_platform, mobility_fleet, …) |
| Layout | Four folders (intent / datapoints / evaluators / reference_payloads) joined by a shared NNNNNN id |
| Grading | top_down: DB-diff vs golden delta + LLM rubric. bottom_up: inlined recall scorer, self-contained, with a __main__ self-check that reproduces the reference reward |
| Source (bottom_up) | EnvSim rollouts, qwen3-5-397b, perfect-score set; verbatim turns, reconstructed system preamble |
| Tool schema | Per sample, every tool of every MCP the trajectory touched (the action space, not a call log) |
Data
NNNNNN id refers to one sample across all four files. The task’s environment (vertical) is a field inside intent/ ("simulation": "ai_ml_startup"), not a directory.
What each folder holds
| Folder | Role | top_down (WRITE) | bottom_up (READ-ONLY QA) |
|---|---|---|---|
intent/ | the task | goal / persona / motivations / constraints | question / answer_type / difficulty / reasoning_required / hop_count |
datapoints/ | the SFT trajectory | formatted_dialog (system / user / assistant +reasoning +tool_calls / tool) + tool_schema | same shape, normalized from the rollout |
evaluators/ | the reward script | REFERENCE_DIFF + RUBRICS (DB-diff + LLM rubric) | inlined gold + inlined recall scorer (self-contained, __main__ self-check) |
reference_payloads/ | the ground truth | golden DB snapshots (per-MCP table counts + the changed rows, with content) | gold_answer + gold_sources + the specific evidence[] records the answer depends on |
reference_payloads. A write task has a golden DB delta — the exact rows it should insert/update/delete. A read-only QA task never mutates the DB, so its analog is the gold answer plus the specific evidence records it depends on; the full simulation DB is not shipped, because only the touched records matter for grading.
A
top_down reference_payload is the golden delta, not a full DB dump: per system it holds initial_state / final_state table row-counts and a finalization.diff of the rows the trajectory changed, with content. Records the agent only read are absent — they live in the source environment, not the payload.Operations samples (top_down)
All 10 top_down tasks
All 10 top_down tasks
| ID | Company | Goal (abridged) | Systems | Turns / calls |
|---|---|---|---|---|
000001 | Lattice AI | Update the Model-Rollback Guru card with INT8 quantization steps, then propagate to related Notion docs and GitHub issues | figma, github, greenhouse, guru, notion | 49 / 26 |
000002 | Lattice AI | vLLM CVE triage: link INF-19/INF-21 to INF-18 as duplicates and PLAT-11 as related, create a CRM contact, post to #general, set Slack status, start the batch experiment + record the baseline | gmail, linear, optimizely, salesforce, slack | 30 / 19 |
000003 | Ripple Analytics | Investigate the Orbitra Health re-engagement situation across systems, then file a Salesforce support case with the discovered context | amplitude, guru, pagerduty, salesforce, sonarqube | 31 / 14 |
000004 | Cartable | Pre-launch readiness check on the checkout experiment, then launch it, record the 0% coverage gap, and set the SonarQube gate to WARN | gmail, jira, lattice, optimizely, sonarqube | 21 / 12 |
000005 | Lattice AI | Billing-RFC prep: verify cross-system readiness, grant Fiona Notion access, flag GPU-quota risk on the CRM account, log the prep activity, check the calendar | google_calendar, grafana, notion, salesforce, sonarqube | 25 / 10 |
000006 | Bridgewater Relief | Compile a security audit note on a PagerDuty incident, pulling Notion policy, a Snyk review, and Stripe billing status | bitwarden, notion, pagerduty, snyk, stripe | 29 / 17 |
000007 | NexusTech | Create a Q1-2026 Lattice objective for Observability, nested under an existing objective, after a chain of readiness checks | amplitude, greenhouse, lattice, optimizely, salesforce | 25 / 15 |
000008 | RideGrid | Onboard Lakeside Shuttle Co. across billing, CRM, and calendar; handle a conflicting meeting by attempted-delete-then-fallback | argocd, google_calendar, salesforce, snyk, stripe | 22 / 12 |
000009 | Cartable | Create a Cyber-Monday-prep Google Doc synthesizing post-Black-Friday status from accessible sources; add a CRM note; set a Slack status | google_drive, optimizely, salesforce, slack | 24 / 15 |
000010 | Forge CLI | Review a migration-guide PR (#6) with customer-impact context, set a Slack status, start an experiment, create a weekend hotfix on-call schedule | github, optimizely, pagerduty, salesforce, slack | 29 / 15 |
Full task: Billing-RFC prep (top_down 000005)
Full task: Billing-RFC prep (top_down 000005)
Persona: Jasmine Wu, Head of Product / Co-founder at an AI-inference startup.Goal:
Golden delta (
Prepare for the Monday billing RFC sync with Elena by gathering a cross-system readiness picture (code quality of the PLAT project, GPU capacity from Grafana, affected-account status in Salesforce), then grant Fiona Walsh access to the Product Notion space so she can review the GPU-quota meeting notes, update the DocuSearch AI Salesforce account to reflect the GPU-quota risk, log the prep activity against that account, and find a slot to reschedule the sync if the calendar conflicts.Rubric (8 criteria —
goal / process / process_order):| # | Type | Criterion |
|---|---|---|
| R1 | goal | Fiona Walsh is added to the Product Notion space |
| R2 | goal | The DocuSearch AI Salesforce account reflects the GPU-quota risk |
| R3 | goal | A prep activity is logged against that account |
| R4 | process | PLAT code-quality metrics are checked in SonarQube |
| R5 | process | The GPU Cluster Health dashboard is retrieved from Grafana |
| R6 | process | The GPU-quota meeting notes are found in Notion |
| R7 | goal | The calendar is checked for conflicts around the sync |
| R8 | process_order | The account ID is discovered via search before any write against it |
reference_payload): crm_notes +1 (GPU-risk note), crm_activities +1 (prep activity), document_spaces +1 member (Fiona) — graded against REFERENCE_DIFF.Agent trajectory (top_down 000005 — 25 turns, 10 tool calls, 5 systems)
Agent trajectory (top_down 000005 — 25 turns, 10 tool calls, 5 systems)
Assistant turns carry Step 3 — Flag the risk on the discovered account (write after search — satisfies R8):
reasoning (chain-of-thought); reasoning is shown in italics.User:I want to make sure I’m prepped before my billing RFC sync with Elena on Monday… [readiness checks, then grant Fiona access, flag the GPU risk on the account, log the prep, check my calendar].Step 1 — Gather readiness context (3 parallel reads):Tool-first; the request spans multiple systems. Note: there’s no monitoring tool for GPU cluster health yet, so I should not invent that data.
So far: PLAT’s SonarQube gate is passing (0 bugs / 0 vulns / 0 smells) but coverage is 0.0%, so the readiness signal is weak. I found the Product Notion page Fiona needs, and the DocuSearch AI account.User: Yes, go ahead with both. And try Grafana for the GPU Cluster Health dashboard.Step 2 — First write + Grafana grounding:
Done — Fiona has Product Notion access, and I flagged the quota risk on the DocuSearch AI account with the billing/GPU context.User: Did you also log the prep activity for audit purposes? And check my calendar for conflicts.Step 4 — Calendar check (no fabrication):
Not yet — I added the note but haven’t logged a separate activity record. On the calendar I see only the Monday 13:00–13:30 UTC sync, no conflict.User: Yes, log that activity. No reschedule needed — thanks.Step 5 — Final write:
All set — prep activity logged on the DocuSearch AI account, and the Monday sync is conflict-free.Note the agent stays tool-first, performs writes only after the user confirms them, and — when asked for GPU health with no monitoring tool available — states the limitation instead of fabricating numbers.
Investigation samples (bottom_up)
All 10 bottom_up tasks
All 10 bottom_up tasks
| ID | Environment | Question (abridged) | Type | Reasoning | Hops |
|---|---|---|---|---|---|
000001 | creative_agency | Status of all incidents tied to the Asana API rate-limit issue staling Tapestry burn data — how many unresolved, all linked to tracking tickets? | count | aggregation, cross-system | 2 |
000002 | nonprofit_ngo | How many active incidents are open for the FieldPulse PostgreSQL latency issue — any resolved, all linked to one ticket? | count | aggregation, cross-system | 2 |
000003 | b2b_saas_startup | Duplicate tickets around PostgreSQL event-storage / batch-ingestion — are Lin Wei and Ravi Patel coordinating or siloed? | factual | duplicate_effort, multi-hop, temporal | 3 |
000004 | govtech_platform | Are Cameron Reyes and Christine Park coordinating on civicflow-workflow-engine work, or do their tickets overlap silently? | factual | duplicate_effort, multi-hop, temporal | 3 |
000005 | mobility_fleet | Who owns the AV Operations Safety & Data Policy, and who’s the dev on the geofence-management API that ties into it? | factual | ownership_resolution, cross-system | 2 |
000006 | ai_ml_startup | Current state of the batch-inference-v2 feature flag — enabled/configured, or created and left default-off? | factual | contradiction, cross-system | 2 |
000007 | biotech_rnd | Is the Forge Scoring Model v3 flag actually rolling out, or created without configuration? | factual | contradiction, cross-system | 2 |
000008 | edtech_platform | State of new_assessment_engine — has anyone enabled/configured it? Check the audit history. | factual | contradiction, cross-system | 2 |
000009 | freight_logistics | Status of the automated denied-party-screening flag — ever enabled/modified, or no audit history? | factual | contradiction, cross-system | 2 |
000010 | proptech_marketplace | Actual status of new_search_ui — ever toggled / rollout %, or sitting since creation? | factual | contradiction, cross-system | 2 |
000006–000010) share a contradiction pattern: a flag can exist yet be effectively off, so the agent must reconcile the flag’s nominal config against its audit log rather than trust either alone.Full task: Feature-flag state (bottom_up 000006 — ai_ml_startup)
Full task: Feature-flag state (bottom_up 000006 — ai_ml_startup)
Question:
What’s the current state of the batch-inference-v2 feature flag? Has it actually been enabled or configured for any users, or was it just created and left in a default-off state?
Answer type: factual · Difficulty: hard · Reasoning: contradiction, cross-system · Hops: 2Gold answer:
The flag batch-inference-v2 is effectively disabled — created by System but never enabled or configured with rollout targeting, so the feature is unreachable for all users until targeting rules or an explicit enable are applied.
Gold sources: batch-inference-v2 (flag record + audit log). Planted distractors: GPU-POLICY, ONCALL-POLICY, and three unrelated record ids — citing any of them incurs the distractor penalty.Agent trajectory (bottom_up 000006 — 7 turns, 2 tool calls, read-only)
Agent trajectory (bottom_up 000006 — 7 turns, 2 tool calls, read-only)
System (reconstructed — see Provenance):Tool result:Step 2 — Don’t trust the config alone; check the audit log:It’s off in both envs, but I need the audit log to confirm whether it was ever toggled or just created.Tool result: one entry —
You are an enterprise operations analyst with read-only access to the company’s internal systems… Ground every claim in tool results; cite the source records (IDs/keys) you relied on. If the evidence is insufficient, say so rather than guess.User:
What’s the current state of the batch-inference-v2 feature flag? …
Step 1 — Read the flag:Retrieve the flag details to check its current state and configuration.create by System at 2026-05-23 05:08:02, no later modifications.Step 3 — Final answer:TheThe agent resolves the contradiction correctly — an existing flag is not an enabled flag — by cross-checking the config against the audit log, exactly what the task probes.batch-inference-v2flag was created and left in a default-off state. Both production and staging showon: false; the audit log has a singlecreateaction by System on 2026-05-23 with no subsequent toggle or config change (created and modified timestamps are identical).
Grading & self-check (bottom_up 000006 — reward 0.8 = full marks)
Grading & self-check (bottom_up 000006 — reward 0.8 = full marks)
The evaluator inlines the recall scorer from EnvSim and exposes the live contract:Re-scoring the shipped reference answer reproduces the reference reward:0.8 is a perfect score here. Per EnvSim’s scoring, a perfect
factual answer maxes at 0.80 (a perfect count at 0.85) at difficulty=hard — these are the weight ceilings, not partial credit. Running python3 evaluator_000006.py materializes this self-check and confirms the scorer reproduces reference_reward.Scoring note
Eachbottom_up/evaluator_NNNNNN.py is self-contained — the recall-scoring functions are inlined from EnvSim’s eval_scoring.py — and ships a __main__ self-check that re-scores the reference answer and reproduces its reward. Full marks are family-capped: factual → 0.80, count → 0.85 at difficulty=hard (weight ceilings × the hard multiplier of 1.0), so a shipped reference_reward of 0.8 is a perfect answer, not a near-miss. The top_down evaluators grade the realized world: a DB-diff against REFERENCE_DIFF plus the LLM RUBRICS.
Verification
Sampled by signal, delivered by inspection. Candidates were first coarse-filtered (top_down: non-LQ validation reports scoring 1.0; bottom_up: the perfect-score rollout set), then every delivered trajectory was read end-to-end by an auditing sub-agent against four checks: the task actually completed; no hallucinated tool results; every argument is grounded in a real tool result; and the evaluator is sound — would a wrong run actually fail?
One bottom_up “perfect” candidate (dev_tools_company, EVAL-70B9C2) was rejected as a false positive — its gold answer was contradicted by the environment — and replaced. The remaining 20 passed.
Provenance
top_downturns are generated trajectories; tool calls and arguments are grounded in the simulated environment’s real responses.bottom_upuser / assistant / tool turns and the final answer are verbatim from the EnvSim rollout. Only thesystemturn is a faithful reconstruction — the rollout records don’t store the original system prompt, so the read-only-analyst preamble is reproduced, not invented from the answer.
Download
Browse on Hugging Face
View Enterprise Bench files
For the complete Enterprise Bench corpus — its 27 environments, difficulty profile, and benchmark results — see the Full Dataset page.