Skip to main content
Inspired by the APEX benchmark, this dataset was synthesized entirely from scratch by EigenData-CLI — all environments, tasks, and evaluation rubrics are original. The full dataset spans three professional domains: investment banking, law, and management consulting. The demo samples below showcase the management consulting cases — Project Terrace (Floor & Decor retail strategy) and Project Roku (CTV platform analysis) — where the agent must reason over store-level spreadsheets, SEC filings, industry research, customer surveys, and competitor benchmarks.

Overview

PropertyValue
DomainsInvestment Banking, Law, Management Consulting
Demo domainManagement Consulting
ScenarioFloor & Decor (FND) operational performance deep-dive — Project Terrace
Tasks10 consulting analysis tasks with rubric-based grading
Task typesInventory risk analysis, competitor benchmarking, store productivity, installed sales economics, NPS computation, CTV market analysis, ad revenue modeling

Environment

The demo showcases two management consulting environments: Project Terrace (Floor & Decor retail optimization, 104 files) and Project Roku (CTV/streaming platform strategy). The sample below uses Project Terrace:
apex-agent/
├── project_terrace/
│   └── filesystem/
│       ├── 1. Project Admin/          # Engagement charter, timeline, team roster (DOCX)
│       ├── 2. Company Filings/        # SEC 10-K filings for 7 companies (PDF)
│       │   ├── Floor & Decor/
│       │   ├── Home Depot/
│       │   ├── LL Flooring/
│       │   ├── Lowes/
│       │   ├── Mohawk Industries/
│       │   ├── MSC Industrial/
│       │   └── Tile Shop Holdings/
│       ├── 3. Industry Research/      # BLS, Census, FRED, Harvard JCHS data (PDF/XLSX)
│       ├── 4. Analysis Models/        # 30+ FND analysis spreadsheets (XLSX/CSV)
│       ├── 5. Surveys & Research Notes/ # Analyst notes, customer surveys (DOCX/XLSX)
│       ├── 6. Deliverables/           # SteerCo deck, memos, playbooks (PPTX/DOCX)
│       └── 7. Reference & Frameworks/ # Methodology docs, glossary (DOCX)
└── project_roku/
    └── filesystem/                    # Roku CTV platform data (similar structure)
Project Admin (5 DOCX): Engagement charter, kickoff meeting notes, project timeline, scope & work plan, and team roster for Project Terrace.Company Filings (21 PDFs): 10-K annual filings for 7 comparable companies — Floor & Decor (FND), Home Depot (HD), LL Flooring (LL), Lowe’s (LOW), Mohawk Industries (MHK), MSC Industrial (MSM), and Tile Shop Holdings (TTS), covering 2021–2025.Industry Research (14 files): Government and academic data sources — BLS employment, consumer expenditures, CPI shelter, JOLTS, occupational wage surveys (OEWS); Census construction spending, retail sales, e-commerce; FRED housing starts, retail sales; Harvard JCHS state housing reports.Analysis Models (30+ XLSX + 2 CSV): Store-level and company-level analysis spreadsheets including:
  • FND_Installed_Sales_Economics.xlsx — 250 stores with margin, callback, and revenue data
  • FND_Store_Performance_v3.xlsx — store productivity metrics by region
  • FND_Inventory_Analysis.xlsx — inventory turns, aging, slow-moving SKU analysis
  • FND_Competitor_Benchmarking.xlsx — peer comparison across key metrics
  • FND_Customer_Cohort_Analysis.xlsx, FND_Gross_Margin_Bridge.xlsx, FND_Scenario_Sensitivity_Model.xlsx, and 20+ more
  • Company financial extracts for HD, LL, LOW, MHK, MSM, TTS
Surveys & Research Notes (10 files): Analyst notes for each peer company, customer survey data and column guide, expert call notes on the flooring industry.Deliverables (5 files): Competitive landscape analysis (PPTX), executive summary memo, implementation playbook, SteerCo presentation, and store productivity deep dive.Reference & Frameworks (7 DOCX): Benchmarking methodology, connected customer analysis framework, industry glossary, pro customer economics, renovation ROI framework, SG&A efficiency analysis, and store classification framework.

Data

FileDescription
tasks_and_rubrics.json10 tasks with prompts, gold responses, and rubric criteria
trajectories.jsonlAgent trajectories with chain-of-thought reasoning and tool calls
tool_schemas.jsonTool definitions available to the agent (filesystem, spreadsheets, code execution, documents)

Sample

TaskWorldDescriptionExpected Output
terrace_01Project TerraceInventory risk analysis by regionSE inventory $46.5M, 16 problem stores, worst FND-187
terrace_02Project TerraceBenchmark Floor & Decor against peersFND growth 3.7%, LL Flooring -15.7%, Tile Shop GM 64.9%
terrace_03Project TerraceStore productivity and profitability by regionNE $58.28/sqft, MW $54.02/sqft, company avg $56.27/sqft
terrace_04Project TerraceRank stores by net installation marginBest FND-109 at 12.0%, worst FND-180 at 5.0%, avg 8.7%
terrace_05Project TerraceCompute NPS from raw customer survey dataDIY NPS -43.1, Design-Assisted -43.8, Pro -47.5
roku_01Project RokuFAST channel lineup economicsTop: Entertainment $84.0M, News $50.4M
roku_02Project RokuCTV platform market concentration (HHI)HHI 1500, CR3 56%
roku_03Project RokuCTV viewer and advertiser NPSViewer -39.2, Advertiser -54.0, Overall -42.9
roku_04Project RokuQuarterly roll-up of Roku Channel performanceQ2 2023 Ad Rev $545.7M, margin 68.1%
roku_05Project RokuCross-analysis of advertising categoriesTop efficiency: Automotive (5.04), CPG/FMCG (4.80)
Prompt:
Which stores have the best and worst economics on installed sales? Rank all stores by net installation margin, then give me these key highlights: the #1 store and its margin, the lowest-margin store and its margin, the portfolio average net installation margin, and the gap in percentage points between the best and worst store. Also report the installed revenue for FND-194 and FND-144 specifically, and the callback rate for FND-307. Round percentages to one decimal place and revenue to the nearest dollar. Reply here.
Expected output:
  • Top store: FND-109 with net installation margin of 12.0%
  • Lowest store: FND-180 with net installation margin of 5.0%
  • Portfolio avg net installation margin: 8.7%
  • Gap between best and worst: 7.0pp
  • FND-194 installed revenue: $5,223,275
  • FND-144 installed revenue: $4,933,729
  • FND-307 callback rate: 7.4%
Rubric criteria:
  1. States that FND-109 is the top-ranked store with net installation margin of 12.0%
  2. States that FND-180 is among the lowest net installation margin stores at 5.0%
  3. States that portfolio average net installation margin is 8.7%
  4. States that the gap between best and worst store is 7.0pp
  5. States that FND-194 has installed revenue of $5,223,275
  6. States that FND-307 has a callback rate of 7.4%
  7. States that FND-144 has installed revenue of $4,933,729
Each assistant message includes reasoning_content (chain-of-thought) showing the agent’s step-by-step analytical reasoning. Reasoning is shown in italics below.System:
You are an AI assistant with access to Code Execution, Excel/Spreadsheets, Filesystem, and Word/Documents. You MUST use tools to explore files and data before answering. NEVER assume. Keep full precision through intermediate calculations and only round final reported values.
User:
Which stores have the best and worst economics on installed sales? Rank all stores by net installation margin, then give me these key highlights: the #1 store and its margin, the lowest-margin store and its margin, the portfolio average net installation margin, and the gap in percentage points between the best and worst store. Also report the installed revenue for FND-194 and FND-144 specifically, and the callback rate for FND-307. Round percentages to one decimal place and revenue to the nearest dollar. Reply here.
Assistant:Step 1 — Explore directory tree:The user is asking about installed sales economics across stores — specifically ranking all stores by net installation margin, identifying best/worst, computing the portfolio average, finding the gap, plus specific revenue figures for FND-194 and FND-144, and the callback rate for FND-307. I need to locate the relevant data file(s) first…
I’ll start by exploring the filesystem to find relevant files, then thoroughly analyze the data.
filesystem_get_directory_tree({"path": "/", "max_depth": 3})
Tool result:
/
├── 1. Project Admin/
│   ├── Engagement Charter - Project Terrace.docx
│   ├── Meeting Notes - Week 1 Kickoff.docx
│   └── ...
├── 2. Company Filings/
│   ├── Floor & Decor/ (FND 10-K 2023–2025)
│   ├── Home Depot/ ...
│   └── ...
├── 3. Industry Research/ ...
├── 4. Analysis Models/
│   ├── FND_Installed_Sales_Economics.xlsx
│   ├── FND_Store_Performance_v3.xlsx
│   └── ... (30+ analysis files)
├── 5. Surveys & Research Notes/ ...
├── 6. Deliverables/ ...
└── 7. Reference & Frameworks/ ...
Step 2 — List spreadsheet tabs:The directory tree reveals a file that’s an exact match for the user’s question: /4. Analysis Models/FND_Installed_Sales_Economics.xlsx. “Installed Sales Economics” maps directly to the request for net installation margins, callback rates, and installed revenue by store…
The key file is FND_Installed_Sales_Economics.xlsx. Let me explore it.
excel_list_tabs_in_spreadsheet({"file_path": "/4. Analysis Models/FND_Installed_Sales_Economics.xlsx"})
Tool result:
{"worksheets": [
  {"name": "Installed_Sales_Econ", "index": 0, "row_count": 251, "column_count": 11}
]}
Step 3 — Read the full spreadsheet:Single tab called ‘Installed_Sales_Econ’ with 251 rows and 11 columns. That’s 1 header row + 250 data rows — a manageable size to read in full. With 11 columns, the data likely has all the metrics I need in one consolidated table…
excel_read_tab({"file_path": "/4. Analysis Models/FND_Installed_Sales_Economics.xlsx", "tab_index": 0})
Tool result:
Store_ID  Installed_Revenue  Product_Margin_%  Installation_Markup_%
Subcontractor_Cost  Net_Installation_Margin_%  Callback_Rate_%
Callback_Cost_per_Job  Active_Subcontractors  Avg_Project_Size  Projects_per_Month

FND-101   3,283,341   0.442   0.273   1,608,686   0.097   0.119   403   9    5,932   74
FND-102   3,001,568   0.439   0.348   1,395,241   0.058   0.110   263   23   2,525   58
FND-103   2,195,965   0.403   0.329     949,084   0.088   0.083   161   21   7,996   26
... (250 stores total — FND-101 through FND-350)
FND-350   2,165,799   0.462   0.393     878,750   0.116   0.041   303   17   7,852   67
Step 4 — Execute Python computation:Now I have the full dataset: 250 stores (FND-101 through FND-350) with 11 columns. The columns relevant to the user’s question are: column B (Installed_Revenue), column F (Net_Installation_Margin_%), and column G (Callback_Rate_%). I need to rank all 250 stores by column F…
Now I have all the data. Let me run a Python script to compute the rankings and extract the specific values requested.
code_execution_code_exec({"request": {"code": "# Rank stores by net installation margin, extract highlights ..."}})
Tool result:
Total stores: 250

#1 (Best) Store: FND-109 — Net Installation Margin: 12.0%
Worst Store: FND-293 — Net Installation Margin: 5.0%
Portfolio Average Net Installation Margin: 8.7%
Gap (Best - Worst): 7.0 percentage points
FND-144 Installed Revenue: $4,933,729
FND-194 Installed Revenue: $5,223,275
FND-307 Callback Rate: 7.4%

--- Checking for ties ---
Stores tied at top (0.12): ['FND-109', 'FND-144', 'FND-153']
Stores tied at bottom (0.05): ['FND-180', 'FND-293']

--- Full Ranking (Top 10 and Bottom 10) ---
  Rank 1: FND-109 — 12.0%
  Rank 2: FND-144 — 12.0%
  Rank 3: FND-153 — 12.0%
  ...
  Rank 249: FND-180 — 5.0%
  Rank 250: FND-293 — 5.0%
Step 5 — Verify tied stores and callback rate (6 parallel reads):The code found a three-way tie at the top: FND-109, FND-144, and FND-153 all at 0.12 (12.0%), and a two-way tie at the bottom: FND-180 and FND-293 both at 0.05 (5.0%). But these margin values were stored as rounded decimals. Let me verify the exact stored values to confirm tie-breaking…
There’s a three-way tie at the top (12.0%) and a two-way tie at the bottom (5.0%). Let me verify the exact stored values to confirm the tie-breaking and also double-check the callback rate for FND-307.
excel_read_tab({"file_path": "...", "cell_range": "A10:K10"})    // FND-109
excel_read_tab({"file_path": "...", "cell_range": "A45:K45"})    // FND-144
excel_read_tab({"file_path": "...", "cell_range": "A54:K54"})    // FND-153
excel_read_tab({"file_path": "...", "cell_range": "A81:K81"})    // FND-180
excel_read_tab({"file_path": "...", "cell_range": "A194:K194"})  // FND-293
excel_read_tab({"file_path": "...", "cell_range": "A208:K208"})  // FND-307
Tool results:
FND-109   1,930,484   0.399   0.265   907,674   0.12   0.073   428   16   7,071   31
FND-144   4,933,729   0.452   0.332   2,111,506   0.12   0.049   263   24   6,241   35
FND-153   1,510,841   0.440   0.398   539,545   0.12   0.120   472   21   3,865   54
FND-180   3,081,065   0.390   0.373   1,266,627   0.05   0.060   375   8   6,243   77
FND-293   2,238,899   0.470   0.367   784,513   0.05   0.052   417   12   3,170   53
FND-307   2,781,939   0.473   0.326   1,277,247   0.051   0.074   264   9   6,298   70
Step 6 — Final answer:All six verification reads confirm the stored values exactly match the computation. The three top-margin stores all show F = 0.12 with no hidden precision differences — FND-109, FND-144, and FND-153 are genuinely tied. The two bottom stores both show F = 0.05 — FND-180 and FND-293 are also a true tie.

Installed Sales Economics — Store Ranking Highlights

MetricValue
#1 Store (Best Margin)FND-109 — 12.0% (tied with FND-144 and FND-153)
Lowest-Margin StoreFND-180 — 5.0% (tied with FND-293)
Portfolio Avg Net Installation Margin8.7%
Gap (Best vs. Worst)7.0 percentage points
StoreMetricValue
FND-194Installed Revenue$5,223,275
FND-144Installed Revenue$4,933,729
FND-307Callback Rate7.4%
The 7.0 pp spread from best to worst signals meaningful variation in subcontractor cost management and installation markup discipline across the portfolio, suggesting room for margin improvement at the bottom of the distribution. FND-307 ranks near the bottom of the margin table (5.1%) and carries a 7.4% callback rate, which is roughly in line with the portfolio but paired with weak margin economics.
{
  "task_id": "terrace_04",
  "rubric_score": 1.0,
  "strict_pass": true,
  "scoring_values": {
    "passed_count": 7,
    "failed_count": 0,
    "total_count": 7,
    "grade_score_percentage": 100.0
  },
  "steps": 6,
  "tool_calls": 10,
  "grading_time_seconds": 18.2
}
Each rubric criterion is independently verified with a detailed rationale:
{
  "verifier_id": "ver_terrace_04_01",
  "score": 1.0,
  "grade": "pass",
  "rationale": "## Evidence\n\n... detailed analysis ...\n\n## Assessment\n\n... MEETS criterion ..."
}

Download

# Download APEX Agent data
hf download jindidi/eigendata-demo-data --repo-type dataset --include "apex_agent/*"

Browse on Hugging Face

View APEX Agent files