APEX Agent

Inspired by the APEX benchmark, this dataset was synthesized entirely from scratch by EigenData-CLI — all environments, tasks, and evaluation rubrics are original. The full dataset spans three professional domains: investment banking, law, and management consulting. The demo samples below showcase the management consulting cases — Project Terrace (Floor & Decor retail strategy) and Project Roku (CTV platform analysis) — where the agent must reason over store-level spreadsheets, SEC filings, industry research, customer surveys, and competitor benchmarks.

Overview

Property	Value
Domains	Investment Banking, Law, Management Consulting
Demo domain	Management Consulting
Scenario	Floor & Decor (FND) operational performance deep-dive — Project Terrace
Tasks	10 consulting analysis tasks with rubric-based grading
Task types	Inventory risk analysis, competitor benchmarking, store productivity, installed sales economics, NPS computation, CTV market analysis, ad revenue modeling

Environment

The demo showcases two management consulting environments: Project Terrace (Floor & Decor retail optimization, 104 files) and Project Roku (CTV/streaming platform strategy). The sample below uses Project Terrace:

apex-agent/
├── project_terrace/
│   └── filesystem/
│       ├── 1. Project Admin/          # Engagement charter, timeline, team roster (DOCX)
│       ├── 2. Company Filings/        # SEC 10-K filings for 7 companies (PDF)
│       │   ├── Floor & Decor/
│       │   ├── Home Depot/
│       │   ├── LL Flooring/
│       │   ├── Lowes/
│       │   ├── Mohawk Industries/
│       │   ├── MSC Industrial/
│       │   └── Tile Shop Holdings/
│       ├── 3. Industry Research/      # BLS, Census, FRED, Harvard JCHS data (PDF/XLSX)
│       ├── 4. Analysis Models/        # 30+ FND analysis spreadsheets (XLSX/CSV)
│       ├── 5. Surveys & Research Notes/ # Analyst notes, customer surveys (DOCX/XLSX)
│       ├── 6. Deliverables/           # SteerCo deck, memos, playbooks (PPTX/DOCX)
│       └── 7. Reference & Frameworks/ # Methodology docs, glossary (DOCX)
└── project_roku/
    └── filesystem/                    # Roku CTV platform data (similar structure)

Environment details — 104 files across 7 categories (Project Terrace)

Project Admin (5 DOCX): Engagement charter, kickoff meeting notes, project timeline, scope & work plan, and team roster for Project Terrace.Company Filings (21 PDFs): 10-K annual filings for 7 comparable companies — Floor & Decor (FND), Home Depot (HD), LL Flooring (LL), Lowe’s (LOW), Mohawk Industries (MHK), MSC Industrial (MSM), and Tile Shop Holdings (TTS), covering 2021–2025.Industry Research (14 files): Government and academic data sources — BLS employment, consumer expenditures, CPI shelter, JOLTS, occupational wage surveys (OEWS); Census construction spending, retail sales, e-commerce; FRED housing starts, retail sales; Harvard JCHS state housing reports.Analysis Models (30+ XLSX + 2 CSV): Store-level and company-level analysis spreadsheets including:

FND_Installed_Sales_Economics.xlsx — 250 stores with margin, callback, and revenue data
FND_Store_Performance_v3.xlsx — store productivity metrics by region
FND_Inventory_Analysis.xlsx — inventory turns, aging, slow-moving SKU analysis
FND_Competitor_Benchmarking.xlsx — peer comparison across key metrics
FND_Customer_Cohort_Analysis.xlsx, FND_Gross_Margin_Bridge.xlsx, FND_Scenario_Sensitivity_Model.xlsx, and 20+ more
Company financial extracts for HD, LL, LOW, MHK, MSM, TTS

Surveys & Research Notes (10 files): Analyst notes for each peer company, customer survey data and column guide, expert call notes on the flooring industry.Deliverables (5 files): Competitive landscape analysis (PPTX), executive summary memo, implementation playbook, SteerCo presentation, and store productivity deep dive.Reference & Frameworks (7 DOCX): Benchmarking methodology, connected customer analysis framework, industry glossary, pro customer economics, renovation ROI framework, SG&A efficiency analysis, and store classification framework.

Data

File	Description
`tasks_and_rubrics.json`	10 tasks with prompts, gold responses, and rubric criteria
`trajectories.jsonl`	Agent trajectories with chain-of-thought reasoning and tool calls
`tool_schemas.json`	Tool definitions available to the agent (filesystem, spreadsheets, code execution, documents)

Sample

All 10 tasks overview

Task	World	Description	Expected Output
`terrace_01`	Project Terrace	Inventory risk analysis by region	SE inventory $46.5M, 16 problem stores, worst FND-187
`terrace_02`	Project Terrace	Benchmark Floor & Decor against peers	FND growth 3.7%, LL Flooring -15.7%, Tile Shop GM 64.9%
`terrace_03`	Project Terrace	Store productivity and profitability by region	NE $58.28/sqft, MW $54.02/sqft, company avg $56.27/sqft
`terrace_04`	Project Terrace	Rank stores by net installation margin	Best FND-109 at 12.0%, worst FND-180 at 5.0%, avg 8.7%
`terrace_05`	Project Terrace	Compute NPS from raw customer survey data	DIY NPS -43.1, Design-Assisted -43.8, Pro -47.5
`roku_01`	Project Roku	FAST channel lineup economics	Top: Entertainment $84.0M, News $50.4M
`roku_02`	Project Roku	CTV platform market concentration (HHI)	HHI 1500, CR3 56%
`roku_03`	Project Roku	CTV viewer and advertiser NPS	Viewer -39.2, Advertiser -54.0, Overall -42.9
`roku_04`	Project Roku	Quarterly roll-up of Roku Channel performance	Q2 2023 Ad Rev $545.7M, margin 68.1%
`roku_05`	Project Roku	Cross-analysis of advertising categories	Top efficiency: Automotive (5.04), CPG/FMCG (4.80)

Full task: Installed sales economics ranking (terrace_04)

Prompt:

Which stores have the best and worst economics on installed sales? Rank all stores by net installation margin, then give me these key highlights: the #1 store and its margin, the lowest-margin store and its margin, the portfolio average net installation margin, and the gap in percentage points between the best and worst store. Also report the installed revenue for FND-194 and FND-144 specifically, and the callback rate for FND-307. Round percentages to one decimal place and revenue to the nearest dollar. Reply here.

Expected output:

Top store: FND-109 with net installation margin of 12.0%
Lowest store: FND-180 with net installation margin of 5.0%
Portfolio avg net installation margin: 8.7%
Gap between best and worst: 7.0pp
FND-194 installed revenue: $5,223,275
FND-144 installed revenue: $4,933,729
FND-307 callback rate: 7.4%

Rubric criteria:

States that FND-109 is the top-ranked store with net installation margin of 12.0%
States that FND-180 is among the lowest net installation margin stores at 5.0%
States that portfolio average net installation margin is 8.7%
States that the gap between best and worst store is 7.0pp
States that FND-194 has installed revenue of $5,223,275
States that FND-307 has a callback rate of 7.4%
States that FND-144 has installed revenue of $4,933,729

Full agent trajectory (terrace_04 — 18 messages, 10 tool calls)

Each assistant message includes reasoning_content (chain-of-thought) showing the agent’s step-by-step analytical reasoning. Reasoning is shown in italics below.System:

You are an AI assistant with access to Code Execution, Excel/Spreadsheets, Filesystem, and Word/Documents. You MUST use tools to explore files and data before answering. NEVER assume. Keep full precision through intermediate calculations and only round final reported values.

User:

Which stores have the best and worst economics on installed sales? Rank all stores by net installation margin, then give me these key highlights: the #1 store and its margin, the lowest-margin store and its margin, the portfolio average net installation margin, and the gap in percentage points between the best and worst store. Also report the installed revenue for FND-194 and FND-144 specifically, and the callback rate for FND-307. Round percentages to one decimal place and revenue to the nearest dollar. Reply here.

Assistant:Step 1 — Explore directory tree:The user is asking about installed sales economics across stores — specifically ranking all stores by net installation margin, identifying best/worst, computing the portfolio average, finding the gap, plus specific revenue figures for FND-194 and FND-144, and the callback rate for FND-307. I need to locate the relevant data file(s) first…

I’ll start by exploring the filesystem to find relevant files, then thoroughly analyze the data.

filesystem_get_directory_tree({"path": "/", "max_depth": 3})

Tool result:

/
├── 1. Project Admin/
│   ├── Engagement Charter - Project Terrace.docx
│   ├── Meeting Notes - Week 1 Kickoff.docx
│   └── ...
├── 2. Company Filings/
│   ├── Floor & Decor/ (FND 10-K 2023–2025)
│   ├── Home Depot/ ...
│   └── ...
├── 3. Industry Research/ ...
├── 4. Analysis Models/
│   ├── FND_Installed_Sales_Economics.xlsx
│   ├── FND_Store_Performance_v3.xlsx
│   └── ... (30+ analysis files)
├── 5. Surveys & Research Notes/ ...
├── 6. Deliverables/ ...
└── 7. Reference & Frameworks/ ...

Step 2 — List spreadsheet tabs:The directory tree reveals a file that’s an exact match for the user’s question: /4. Analysis Models/FND_Installed_Sales_Economics.xlsx. “Installed Sales Economics” maps directly to the request for net installation margins, callback rates, and installed revenue by store…

The key file is FND_Installed_Sales_Economics.xlsx. Let me explore it.

excel_list_tabs_in_spreadsheet({"file_path": "/4. Analysis Models/FND_Installed_Sales_Economics.xlsx"})

Tool result:

{"worksheets": [
  {"name": "Installed_Sales_Econ", "index": 0, "row_count": 251, "column_count": 11}
]}

Step 3 — Read the full spreadsheet:Single tab called ‘Installed_Sales_Econ’ with 251 rows and 11 columns. That’s 1 header row + 250 data rows — a manageable size to read in full. With 11 columns, the data likely has all the metrics I need in one consolidated table…

excel_read_tab({"file_path": "/4. Analysis Models/FND_Installed_Sales_Economics.xlsx", "tab_index": 0})

Tool result:

Store_ID  Installed_Revenue  Product_Margin_%  Installation_Markup_%
Subcontractor_Cost  Net_Installation_Margin_%  Callback_Rate_%
Callback_Cost_per_Job  Active_Subcontractors  Avg_Project_Size  Projects_per_Month

FND-101   3,283,341   0.442   0.273   1,608,686   0.097   0.119   403   9    5,932   74
FND-102   3,001,568   0.439   0.348   1,395,241   0.058   0.110   263   23   2,525   58
FND-103   2,195,965   0.403   0.329     949,084   0.088   0.083   161   21   7,996   26
... (250 stores total — FND-101 through FND-350)
FND-350   2,165,799   0.462   0.393     878,750   0.116   0.041   303   17   7,852   67

Step 4 — Execute Python computation:Now I have the full dataset: 250 stores (FND-101 through FND-350) with 11 columns. The columns relevant to the user’s question are: column B (Installed_Revenue), column F (Net_Installation_Margin_%), and column G (Callback_Rate_%). I need to rank all 250 stores by column F…

Now I have all the data. Let me run a Python script to compute the rankings and extract the specific values requested.

code_execution_code_exec({"request": {"code": "# Rank stores by net installation margin, extract highlights ..."}})

Tool result:

Total stores: 250

#1 (Best) Store: FND-109 — Net Installation Margin: 12.0%
Worst Store: FND-293 — Net Installation Margin: 5.0%
Portfolio Average Net Installation Margin: 8.7%
Gap (Best - Worst): 7.0 percentage points
FND-144 Installed Revenue: $4,933,729
FND-194 Installed Revenue: $5,223,275
FND-307 Callback Rate: 7.4%

--- Checking for ties ---
Stores tied at top (0.12): ['FND-109', 'FND-144', 'FND-153']
Stores tied at bottom (0.05): ['FND-180', 'FND-293']

--- Full Ranking (Top 10 and Bottom 10) ---
  Rank 1: FND-109 — 12.0%
  Rank 2: FND-144 — 12.0%
  Rank 3: FND-153 — 12.0%
  ...
  Rank 249: FND-180 — 5.0%
  Rank 250: FND-293 — 5.0%

Step 5 — Verify tied stores and callback rate (6 parallel reads):The code found a three-way tie at the top: FND-109, FND-144, and FND-153 all at 0.12 (12.0%), and a two-way tie at the bottom: FND-180 and FND-293 both at 0.05 (5.0%). But these margin values were stored as rounded decimals. Let me verify the exact stored values to confirm tie-breaking…

There’s a three-way tie at the top (12.0%) and a two-way tie at the bottom (5.0%). Let me verify the exact stored values to confirm the tie-breaking and also double-check the callback rate for FND-307.

excel_read_tab({"file_path": "...", "cell_range": "A10:K10"})    // FND-109
excel_read_tab({"file_path": "...", "cell_range": "A45:K45"})    // FND-144
excel_read_tab({"file_path": "...", "cell_range": "A54:K54"})    // FND-153
excel_read_tab({"file_path": "...", "cell_range": "A81:K81"})    // FND-180
excel_read_tab({"file_path": "...", "cell_range": "A194:K194"})  // FND-293
excel_read_tab({"file_path": "...", "cell_range": "A208:K208"})  // FND-307

Tool results:

FND-109   1,930,484   0.399   0.265   907,674   0.12   0.073   428   16   7,071   31
FND-144   4,933,729   0.452   0.332   2,111,506   0.12   0.049   263   24   6,241   35
FND-153   1,510,841   0.440   0.398   539,545   0.12   0.120   472   21   3,865   54
FND-180   3,081,065   0.390   0.373   1,266,627   0.05   0.060   375   8   6,243   77
FND-293   2,238,899   0.470   0.367   784,513   0.05   0.052   417   12   3,170   53
FND-307   2,781,939   0.473   0.326   1,277,247   0.051   0.074   264   9   6,298   70

Step 6 — Final answer:All six verification reads confirm the stored values exactly match the computation. The three top-margin stores all show F = 0.12 with no hidden precision differences — FND-109, FND-144, and FND-153 are genuinely tied. The two bottom stores both show F = 0.05 — FND-180 and FND-293 are also a true tie.

Installed Sales Economics — Store Ranking Highlights

Metric Value
#1 Store (Best Margin) FND-109 — 12.0% (tied with FND-144 and FND-153)
Lowest-Margin Store FND-180 — 5.0% (tied with FND-293)
Portfolio Avg Net Installation Margin 8.7%
Gap (Best vs. Worst) 7.0 percentage points

Store Metric Value
FND-194 Installed Revenue $5,223,275
FND-144 Installed Revenue $4,933,729
FND-307 Callback Rate 7.4%
The 7.0 pp spread from best to worst signals meaningful variation in subcontractor cost management and installation markup discipline across the portfolio, suggesting room for margin improvement at the bottom of the distribution. FND-307 ranks near the bottom of the margin table (5.1%) and carries a 7.4% callback rate, which is roughly in line with the portfolio but paired with weak margin economics.

Metric	Value
#1 Store (Best Margin)	FND-109 — 12.0% (tied with FND-144 and FND-153)
Lowest-Margin Store	FND-180 — 5.0% (tied with FND-293)
Portfolio Avg Net Installation Margin	8.7%
Gap (Best vs. Worst)	7.0 percentage points

Store	Metric	Value
FND-194	Installed Revenue	$5,223,275
FND-144	Installed Revenue	$4,933,729
FND-307	Callback Rate	7.4%

Grading result (terrace_04 — 7/7 criteria passed)

{
  "task_id": "terrace_04",
  "rubric_score": 1.0,
  "strict_pass": true,
  "scoring_values": {
    "passed_count": 7,
    "failed_count": 0,
    "total_count": 7,
    "grade_score_percentage": 100.0
  },
  "steps": 6,
  "tool_calls": 10,
  "grading_time_seconds": 18.2
}

Each rubric criterion is independently verified with a detailed rationale:

{
  "verifier_id": "ver_terrace_04_01",
  "score": 1.0,
  "grade": "pass",
  "rationale": "## Evidence\n\n... detailed analysis ...\n\n## Assessment\n\n... MEETS criterion ..."
}

Download

# Download APEX Agent data
hf download jindidi/eigendata-demo-data --repo-type dataset --include "apex_agent/*"

Browse on Hugging Face

View APEX Agent files

Eigen AI

API Reference

Platform

Products

Overview

Environment

Data

Sample

Installed Sales Economics — Store Ranking Highlights

Download

Browse on Hugging Face

Eigen AI

API Reference

Platform

Products

​Overview

​Environment

​Data

​Sample

​Installed Sales Economics — Store Ranking Highlights

​Download

Browse on Hugging Face

Overview

Environment

Data

Sample

Installed Sales Economics — Store Ranking Highlights

Download