Skip to main content
Inspired by the APEX benchmark, the APEX Agent dataset was synthesized entirely from scratch by EigenData-CLI — every environment, task, and grading rubric is original. The full corpus spans three professional domains: investment banking (IB), management consulting (MC), and law, with ~1,000 training-ready samples per domain. Each sample bundles a task, a successful agent trajectory, a reward-verification function, and the workspace environment — usable for either supervised fine-tuning or reinforcement learning. The difficulty, trajectory, and tool-usage sections below characterize the IB+MC portion in detail. The Law slice is summarized in the Training utility section.
Want to try it first? A free 10-task sample is available on the Demo Samples page.

What APEX Agent is

APEX Agent is a corpus of long-horizon, tool-using tasks in finance and law. Each task is a single user prompt describing a deliverable — a number, a memo, a filled-in spreadsheet, a legal analysis — paired with a virtual workspace mounted with the relevant files (PDFs, spreadsheets, documents). The agent works the task using tools for filesystem, PDF reading, spreadsheets, document editing, and code execution, then returns a final answer that is graded against a hand-written rubric. Tasks fall into three streams:
  • IB (Investment Banking) — questions about company filings, 10-Ks, equity research notes, and deal memos. PDF-heavy.
  • MC (Management Consulting) — questions about financial models in Excel: navigating multi-tab workbooks, applying formulas, and producing analysis. Spreadsheet-heavy.
  • Law — long-document analysis: locating relevant clauses, citing them correctly, and chaining textual facts into an argument. Text-retrieval-heavy.

At a glance

PropertyValue
DomainsInvestment Banking, Management Consulting, Law
Samples~1,000 per domain (≈3,000 total) — task + trajectory + reward verifier + environment, usable for SFT or RL. The IB+MC hard tier additionally provides RL-only samples (env + verifier, no gold trajectory).
Task formatSingle prompt + virtual workspace + grading rubric
Workspace file typesPDF, XLSX, CSV, DOCX, PPTX
Agent toolsFilesystem, PDF reading, spreadsheets, documents, code execution
GradingRubric of 3–6 criteria per task; reward = criteria passed ÷ total

What’s inside

ComponentDescription
SamplesThe unit of release. Each sample bundles a task, the workspace environment, and a reward-verification function. ~1,000 samples per domain additionally include a successful agent trajectory (SFT-ready, also RL-trainable). The hard-tier samples lack a gold trajectory but remain RL-trainable via the environment + verifier.
EnvironmentsA self-contained virtual workspace per task — domain files organized into folders (e.g. one management-consulting environment holds 104 files across 7 categories).
Tasks & rubricsEach task ships a prompt, a gold response, and 3–6 rubric criteria the answer must satisfy.
TrajectoriesAgent rollouts with chain-of-thought reasoning and tool calls — the shortest pass-rate-1.0 rollout per task is shipped as the trajectory in the corresponding sample.
Reward verifiersExecutable verification functions that score rollouts against the task rubric — usable as a reward signal for RL.
Tool schemasDefinitions for the filesystem, spreadsheet, document, and code-execution tools available to the agent.

Difficulty profile

To characterize the dataset’s difficulty distribution, a 2,759-task sub-sample of IB+MC tasks was difficulty-classified by running a strong open-weight baseline agent. Each task is assigned to one of three tiers — where “solve” means a perfect score, every rubric criterion passed:
TierDefinition
EasyThe baseline agent solves it unaided.
MediumThe baseline fails unaided, but succeeds when given short protocol guidance.
HardNeither mode solves it.
TierIBMCTotalShare
Easy8987181,61658.6%
Medium24323647917.4%
Hard29636866424.1%
Total1,4371,3222,759100%
The sub-sample shows the IB+MC corpus has a broad difficulty gradient: a large solvable core plus a substantial hard tier (~24%) that defeats both the baseline and current frontier models. Within the dataset’s samples, easy- and medium-tier samples ship with successful trajectories (SFT-ready, also RL-trainable); hard-tier samples ship without trajectories but remain RL-trainable via the environment and reward verifiers.

How challenging is the data

As a reference point, frontier closed-source models were evaluated on sampled subsets of the IB+MC corpus. Balanced pilot — 100 tasks (50 IB + 50 MC):
ModelAllIBMCAvg reward
opus-4.769.0%82.0%56.0%0.890
gpt-5.562.0%68.0%56.0%0.825
Hardest-tier subset — 200 tasks:
ModelPass rate
opus-4.730.0%
gpt-5.528.0%
On the hardest tasks, frontier models land around 30%. These tasks demand multi-step domain reasoning, file-format-compliant outputs, and multi-row trajectory handling that current models do not reliably deliver out of the box — which makes the corpus a strong training and evaluation signal.

Trajectory length

APEX Agent tasks are genuinely long-horizon. The table below summarizes, for IB+MC baseline rollouts, the assistant turns (steps) and tool calls per rollout — shown as mean / median / p90, broken out by domain and difficulty tier:
TierIB stepsMC stepsIB tool callsMC tool calls
Easy22.1 / 19 / 4517.2 / 15 / 3025.6 / 22 / 5022.3 / 20 / 37
Medium17.0 / 15 / 3016.9 / 16 / 2820.3 / 20 / 3221.7 / 20 / 34
Hard17.1 / 16 / 3019.1 / 18 / 3019.6 / 18 / 3224.4 / 24 / 38
  • IB easy tasks have the longest trajectories (mean 22 steps, p90 45) — reading PDFs and 10-Ks involves many page reads.
  • MC trajectories are shorter and more uniform (mean 17–19 steps) — spreadsheet navigation is more direct.
  • Tool calls exceed step counts throughout, since a single assistant turn can issue several tool calls in parallel.

Tool usage

The IB and MC streams exercise different tools, reflecting their different source material:
  • IB (PDF-heavy) — rollouts are dominated by pdfs_read_pdf_pages, filesystem_search_files, and pdfs_search_pdf, with code_execution_code_exec used for numerical work. Code execution rises in prominence on harder tasks, which demand more quantitative analysis.
  • MC (spreadsheet-heavy) — rollouts are dominated by excel_read_tab, excel_list_tabs_in_spreadsheet, and filesystem_search_files, navigating multi-tab financial models. The document tool word_read_document_content also appears on memo-writing tasks.

Training utility

Supervised fine-tuning (SFT) a smaller open-weight model on successful APEX Agent trajectories yields substantial lifts on held-out tasks from the public Mercor APEX-Agents benchmark — a third-party eval set distinct from the training corpus. The IB+MC and Law slices were trained separately because they exercise different agent capabilities — IB+MC emphasizes data, formula, and numerical computation, while Law emphasizes information retrieval and text-based reasoning over long documents. Training data. The dataset provides ~1,000 samples per domain (task + trajectory + reward verifier + environment) for SFT or RL; the hard tier additionally supports RL via its environments and verifiers. The proof-of-concept SFT run below used a 500-per-domain subset drawn from the easy and medium difficulty tiers — half the available SFT-compatible samples per slice. Base model: Qwen3.6-27B. Evaluation sampling: pass@1, max-steps=30, temperature=1.0. Metrics. Strict pass = fraction of rollouts where every rubric criterion was passed (reward = 1.0). Mean reward = average fraction of rubric criteria passed per rollout (captures partial credit). IB+MC results (320 held-out tasks: 160 IB + 160 MC):
ModelIB strict passIB meanMC strict passMC meanOverall strict passOverall mean
Qwen3.6-27B base9.4% (15/160)0.1274.4% (7/160)0.1326.9% (22/320)0.130
+ SFT on IB+MC12.5% (20/160)0.1688.8% (14/160)0.23410.6% (34/320)0.201
Strict pass improves +3.7 pp overall (≈54% relative). MC sees the largest lift — strict pass doubles (4.4% → 8.8%) and mean reward rises +77% relative, reflecting that MC tasks (Excel-tab navigation, multi-row trajectories) benefit most from the cohort-trained protocol. Law results (160 held-out tasks):
ModelStrict passMean reward
Qwen3.6-27B base6.9% (11/160)0.207
+ SFT on Law12.5% (20/160)0.314
Strict pass nearly doubles (6.9% → 12.5%, +5.6 pp, ≈82% relative), and mean reward rises +0.107. The baseline already produced partial credit on many Law tasks, and SFT concentrates on pushing partial-credit rollouts into fully-correct ones. Caveats. Baseline rollouts were graded by claude-haiku-4.5; SFT rollouts by the stricter claude-sonnet-4-5 (≈2–3 pp lower strict pass on shared spot-checks), so the SFT lifts above are a lower bound. All results are pass@1; the headline deltas of 3.7–5.6 pp are robust to sampling noise.

Access & licensing

The full APEX Agent corpus — all environments, tasks, rubrics, and trajectories — is available for commercial licensing, including model training. For licensing, contact support@eigenai.com. A free 10-task sample is available now under the CC BY-NC-ND 4.0 license — see Demo Samples.