Documentation Index
Fetch the complete documentation index at: https://docs.eigenai.com/llms.txt
Use this file to discover all available pages before exploring further.
Inspired by the APEX benchmark, the APEX Agent dataset was synthesized entirely from scratch by EigenData-CLI — every environment, task, and grading rubric is original. The full corpus spans three professional domains: investment banking, law, and management consulting.
The scale, difficulty, and benchmark figures on this page cover the investment banking (IB) and management consulting (MC) portion of the corpus — a graded universe of 2,759 tasks.
Want to try it first? A free 10-task sample is available on the Demo Samples page.
What APEX Agent is
APEX Agent is a corpus of long-horizon, finance-domain, tool-using tasks. Each task is a single user prompt describing a deliverable — a number, a memo, a filled-in spreadsheet — paired with a virtual workspace mounted with the relevant files (PDFs, spreadsheets, documents). The agent works the task using tools for filesystem, PDF reading, spreadsheets, document editing, and code execution, then returns a final answer that is graded against a hand-written rubric.
Tasks fall into two streams:
- IB (Investment Banking) — questions about company filings, 10-Ks, equity research notes, and deal memos. PDF-heavy.
- MC (Management Consulting) — questions about financial models in Excel: navigating multi-tab workbooks, applying formulas, and producing analysis. Spreadsheet-heavy.
At a glance
| Property | Value |
|---|
| Domains | Investment Banking, Law, Management Consulting |
| Graded task universe (IB + MC) | 2,759 tasks — 1,437 IB, 1,322 MC |
| Task format | Single prompt + virtual workspace + grading rubric |
| Workspace file types | PDF, XLSX, CSV, DOCX, PPTX |
| Agent tools | Filesystem, PDF reading, spreadsheets, documents, code execution |
| Grading | Rubric of 3–6 criteria per task; reward = criteria passed ÷ total |
What’s inside
| Component | Description |
|---|
| Environments | A self-contained virtual workspace per task — domain files organized into folders (e.g. one management-consulting environment holds 104 files across 7 categories). |
| Tasks & rubrics | Each task ships a prompt, a gold response, and 3–6 rubric criteria the answer must satisfy. |
| Trajectories | Agent rollouts with chain-of-thought reasoning and tool calls. |
| Tool schemas | Definitions for the filesystem, spreadsheet, document, and code-execution tools available to the agent. |
Difficulty profile
Every task in the IB + MC universe is sorted into one of three difficulty tiers, based on whether a strong open-weight baseline agent can solve it — where “solve” means a perfect score, every rubric criterion passed:
| Tier | Definition |
|---|
| Easy | The baseline agent solves it unaided. |
| Medium | The baseline fails unaided, but succeeds when given short protocol guidance. |
| Hard | Neither mode solves it. |
| Tier | IB | MC | Total | Share |
|---|
| Easy | 898 | 718 | 1,616 | 58.6% |
| Medium | 243 | 236 | 479 | 17.4% |
| Hard | 296 | 368 | 664 | 24.1% |
| Total | 1,437 | 1,322 | 2,759 | 100% |
The mix gives the corpus a broad difficulty gradient: a large solvable core plus a substantial hard tier (24%) that defeats both the baseline and current frontier models.
How challenging is the data
As a reference point, frontier closed-source models were evaluated on sampled subsets of the corpus.
Balanced pilot — 100 tasks (50 IB + 50 MC):
| Model | All | IB | MC | Avg reward |
|---|
| opus-4.7 | 69.0% | 82.0% | 56.0% | 0.890 |
| gpt-5.5 | 62.0% | 68.0% | 56.0% | 0.825 |
Hardest-tier subset — 200 tasks:
| Model | Pass rate |
|---|
| opus-4.7 | 30.0% |
| gpt-5.5 | 28.0% |
On the hardest tasks, frontier models land around 30%. These tasks demand multi-step domain reasoning, file-format-compliant outputs, and multi-row trajectory handling that current models do not reliably deliver out of the box — which makes the corpus a strong training and evaluation signal.
Trajectory length
APEX Agent tasks are genuinely long-horizon. The table below summarizes, for baseline agent rollouts, the assistant turns (steps) and tool calls per rollout — shown as mean / median / p90, broken out by domain and difficulty tier:
| Tier | IB steps | MC steps | IB tool calls | MC tool calls |
|---|
| Easy | 22.1 / 19 / 45 | 17.2 / 15 / 30 | 25.6 / 22 / 50 | 22.3 / 20 / 37 |
| Medium | 17.0 / 15 / 30 | 16.9 / 16 / 28 | 20.3 / 20 / 32 | 21.7 / 20 / 34 |
| Hard | 17.1 / 16 / 30 | 19.1 / 18 / 30 | 19.6 / 18 / 32 | 24.4 / 24 / 38 |
- IB easy tasks have the longest trajectories (mean 22 steps, p90 45) — reading PDFs and 10-Ks involves many page reads.
- MC trajectories are shorter and more uniform (mean 17–19 steps) — spreadsheet navigation is more direct.
- Tool calls exceed step counts throughout, since a single assistant turn can issue several tool calls in parallel.
The two streams exercise different tools, reflecting their different source material:
- IB (PDF-heavy) — rollouts are dominated by
pdfs_read_pdf_pages, filesystem_search_files, and pdfs_search_pdf, with code_execution_code_exec used for numerical work. Code execution rises in prominence on harder tasks, which demand more quantitative analysis.
- MC (spreadsheet-heavy) — rollouts are dominated by
excel_read_tab, excel_list_tabs_in_spreadsheet, and filesystem_search_files, navigating multi-tab financial models. The document tool word_read_document_content also appears on memo-writing tasks.
Access & licensing
The full APEX Agent corpus — all environments, tasks, rubrics, and trajectories — is available for commercial licensing, including model training. For licensing, contact support@eigenai.com.
A free 10-task sample is available now under the CC BY-NC-ND 4.0 license — see Demo Samples.