Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.eigenai.com/llms.txt

Use this file to discover all available pages before exploring further.

Inspired by the APEX benchmark, the APEX Agent dataset was synthesized entirely from scratch by EigenData-CLI — every environment, task, and grading rubric is original. The full corpus spans three professional domains: investment banking, law, and management consulting. The scale, difficulty, and benchmark figures on this page cover the investment banking (IB) and management consulting (MC) portion of the corpus — a graded universe of 2,759 tasks.
Want to try it first? A free 10-task sample is available on the Demo Samples page.

What APEX Agent is

APEX Agent is a corpus of long-horizon, finance-domain, tool-using tasks. Each task is a single user prompt describing a deliverable — a number, a memo, a filled-in spreadsheet — paired with a virtual workspace mounted with the relevant files (PDFs, spreadsheets, documents). The agent works the task using tools for filesystem, PDF reading, spreadsheets, document editing, and code execution, then returns a final answer that is graded against a hand-written rubric. Tasks fall into two streams:
  • IB (Investment Banking) — questions about company filings, 10-Ks, equity research notes, and deal memos. PDF-heavy.
  • MC (Management Consulting) — questions about financial models in Excel: navigating multi-tab workbooks, applying formulas, and producing analysis. Spreadsheet-heavy.

At a glance

PropertyValue
DomainsInvestment Banking, Law, Management Consulting
Graded task universe (IB + MC)2,759 tasks — 1,437 IB, 1,322 MC
Task formatSingle prompt + virtual workspace + grading rubric
Workspace file typesPDF, XLSX, CSV, DOCX, PPTX
Agent toolsFilesystem, PDF reading, spreadsheets, documents, code execution
GradingRubric of 3–6 criteria per task; reward = criteria passed ÷ total

What’s inside

ComponentDescription
EnvironmentsA self-contained virtual workspace per task — domain files organized into folders (e.g. one management-consulting environment holds 104 files across 7 categories).
Tasks & rubricsEach task ships a prompt, a gold response, and 3–6 rubric criteria the answer must satisfy.
TrajectoriesAgent rollouts with chain-of-thought reasoning and tool calls.
Tool schemasDefinitions for the filesystem, spreadsheet, document, and code-execution tools available to the agent.

Difficulty profile

Every task in the IB + MC universe is sorted into one of three difficulty tiers, based on whether a strong open-weight baseline agent can solve it — where “solve” means a perfect score, every rubric criterion passed:
TierDefinition
EasyThe baseline agent solves it unaided.
MediumThe baseline fails unaided, but succeeds when given short protocol guidance.
HardNeither mode solves it.
TierIBMCTotalShare
Easy8987181,61658.6%
Medium24323647917.4%
Hard29636866424.1%
Total1,4371,3222,759100%
The mix gives the corpus a broad difficulty gradient: a large solvable core plus a substantial hard tier (24%) that defeats both the baseline and current frontier models.

How challenging is the data

As a reference point, frontier closed-source models were evaluated on sampled subsets of the corpus. Balanced pilot — 100 tasks (50 IB + 50 MC):
ModelAllIBMCAvg reward
opus-4.769.0%82.0%56.0%0.890
gpt-5.562.0%68.0%56.0%0.825
Hardest-tier subset — 200 tasks:
ModelPass rate
opus-4.730.0%
gpt-5.528.0%
On the hardest tasks, frontier models land around 30%. These tasks demand multi-step domain reasoning, file-format-compliant outputs, and multi-row trajectory handling that current models do not reliably deliver out of the box — which makes the corpus a strong training and evaluation signal.

Trajectory length

APEX Agent tasks are genuinely long-horizon. The table below summarizes, for baseline agent rollouts, the assistant turns (steps) and tool calls per rollout — shown as mean / median / p90, broken out by domain and difficulty tier:
TierIB stepsMC stepsIB tool callsMC tool calls
Easy22.1 / 19 / 4517.2 / 15 / 3025.6 / 22 / 5022.3 / 20 / 37
Medium17.0 / 15 / 3016.9 / 16 / 2820.3 / 20 / 3221.7 / 20 / 34
Hard17.1 / 16 / 3019.1 / 18 / 3019.6 / 18 / 3224.4 / 24 / 38
  • IB easy tasks have the longest trajectories (mean 22 steps, p90 45) — reading PDFs and 10-Ks involves many page reads.
  • MC trajectories are shorter and more uniform (mean 17–19 steps) — spreadsheet navigation is more direct.
  • Tool calls exceed step counts throughout, since a single assistant turn can issue several tool calls in parallel.

Tool usage

The two streams exercise different tools, reflecting their different source material:
  • IB (PDF-heavy) — rollouts are dominated by pdfs_read_pdf_pages, filesystem_search_files, and pdfs_search_pdf, with code_execution_code_exec used for numerical work. Code execution rises in prominence on harder tasks, which demand more quantitative analysis.
  • MC (spreadsheet-heavy) — rollouts are dominated by excel_read_tab, excel_list_tabs_in_spreadsheet, and filesystem_search_files, navigating multi-tab financial models. The document tool word_read_document_content also appears on memo-writing tasks.

Access & licensing

The full APEX Agent corpus — all environments, tasks, rubrics, and trajectories — is available for commercial licensing, including model training. For licensing, contact support@eigenai.com. A free 10-task sample is available now under the CC BY-NC-ND 4.0 license — see Demo Samples.