Full Dataset - Documentation

Inspired by the APEX benchmark, the APEX Agent dataset was synthesized entirely from scratch by EigenData-CLI — every environment, task, and grading rubric is original. The full corpus spans three professional domains: investment banking (IB), management consulting (MC), and law, with ~1,000 training-ready samples per domain (task + successful agent trajectory + reward-verification function + environment), usable for either supervised fine-tuning or reinforcement learning. On top of these, each domain (law, IB, and MC) adds ~1,000 hard-tier tasks released RL-only — environment + reward-verification function, no gold trajectory — for reinforcement learning. The difficulty, trajectory, and tool-usage sections below characterize the IB+MC portion in detail. The Law slice is summarized in the Training utility section.

Want to try it first? A free 10-task sample is available on the Demo Samples page.

What APEX Agent is

APEX Agent is a corpus of long-horizon, tool-using tasks in finance and law. Each task is a single user prompt describing a deliverable — a number, a memo, a filled-in spreadsheet, a legal analysis — paired with a virtual workspace mounted with the relevant files (PDFs, spreadsheets, documents, and correspondence). The agent works the task using tools for filesystem, PDF reading, spreadsheets, document editing, code execution, and correspondence (email, chat, calendar), then returns a final answer that is graded against a hand-written rubric. Tasks fall into three streams:

IB (Investment Banking) — questions about company filings, 10-Ks, equity research notes, and deal memos. PDF-heavy.
MC (Management Consulting) — questions about financial models in Excel: navigating multi-tab workbooks, applying formulas, and producing analysis. Spreadsheet-heavy.
Law — long-document analysis: locating relevant clauses, citing them correctly, and chaining textual facts into an argument. Text-retrieval-heavy.

At a glance

Property	Value
Domains	Investment Banking, Management Consulting, Law
Samples	~1,000 per domain (≈3,000 total) — task + trajectory + reward verifier + environment, usable for SFT or RL. Plus ~1,000 hard-tier RL-only tasks per domain (law, IB, MC) — env + reward-verification function, no gold trajectory.
Task format	Single prompt + virtual workspace + grading rubric
Workspace file types	PDF, XLSX, CSV, DOCX, PPTX; plus email (mbox) threads and chat channels
Agent tools	Filesystem, PDF reading, spreadsheets, documents, code execution, email, chat, calendar
Grading	Rubric of 3–6 criteria per task; `reward = criteria passed ÷ total`

What’s inside

Component	Description
Samples	The unit of release. Each sample bundles a task, the workspace environment, and a reward-verification function. ~1,000 samples per domain additionally include a successful agent trajectory (SFT-ready, also RL-trainable). The hard-tier samples lack a gold trajectory but remain RL-trainable via the environment + verifier.
Environments	A self-contained virtual workspace per task — domain files organized into folders (e.g. one management-consulting environment holds 104 files across 7 categories).
Tasks & rubrics	Each task ships a prompt, a gold response, and 3–6 rubric criteria the answer must satisfy.
Trajectories	Agent rollouts with chain-of-thought reasoning and tool calls — the shortest pass-rate-1.0 rollout per task is shipped as the trajectory in the corresponding sample.
Reward verifiers	Executable verification functions that score rollouts against the task rubric — usable as a reward signal for RL.
Tool schemas	Definitions for the filesystem, spreadsheet, document, code-execution, and correspondence (email, chat, calendar) tools available to the agent.

Difficulty profile

To characterize the dataset’s difficulty distribution, IB+MC tasks are difficulty-classified by running a strong open-weight baseline agent. Each task is assigned to one of three tiers — where “solve” means a perfect score, every rubric criterion passed:

Tier	Definition
Easy	The baseline agent solves it unaided.
Medium	The baseline fails unaided, but succeeds when given short protocol guidance.
Hard	Neither mode solves it.

Tier	IB	MC	Total	Share
Easy	898	718	1,616	40.1%
Medium	243	236	479	11.9%
Hard	931	999	1,930	48.0%
Total	2,072	1,953	4,025	100%

Across the IB+MC corpus the difficulty gradient is broad: a large solvable core plus a substantial hard tier (~48%) that defeats both the baseline and current frontier models. Easy- and medium-tier samples ship with successful trajectories (SFT-ready, also RL-trainable); hard-tier samples ship without trajectories but remain RL-trainable via the environment and reward verifiers.

How challenging is the data

As a reference point, frontier closed-source models were evaluated on sampled subsets of the IB+MC corpus. Balanced pilot — 100 tasks (50 IB + 50 MC):

Model	All	IB	MC	Avg reward
opus-4.7	69.0%	82.0%	56.0%	0.890
gpt-5.5	62.0%	68.0%	56.0%	0.825

Hardest-tier subset — 200 tasks:

Model	Pass rate
opus-4.7	30.0%
gpt-5.5	28.0%

On the hardest tasks, frontier models land around 30%. These tasks demand multi-step domain reasoning, file-format-compliant outputs, and multi-row trajectory handling that current models do not reliably deliver out of the box — which makes the corpus a strong training and evaluation signal.

Trajectory length

APEX Agent tasks are genuinely long-horizon. The table below summarizes, for IB+MC baseline rollouts, the assistant turns (steps) and tool calls per rollout — shown as mean / median / p90, broken out by domain and difficulty tier:

Tier	IB steps	MC steps	IB tool calls	MC tool calls
Easy	22.1 / 19 / 45	17.2 / 15 / 30	25.6 / 22 / 50	22.3 / 20 / 37
Medium	17.0 / 15 / 30	16.9 / 16 / 28	20.3 / 20 / 32	21.7 / 20 / 34
Hard	17.1 / 16 / 30	19.1 / 18 / 30	19.6 / 18 / 32	24.4 / 24 / 38

IB easy tasks have the longest trajectories (mean 22 steps, p90 45) — reading PDFs and 10-Ks involves many page reads.
MC trajectories are shorter and more uniform (mean 17–19 steps) — spreadsheet navigation is more direct.
Tool calls exceed step counts throughout, since a single assistant turn can issue several tool calls in parallel.

Tool usage

The IB and MC streams exercise different tools, reflecting their different source material:

IB (PDF-heavy) — rollouts are dominated by pdfs_read_pdf_pages, filesystem_search_files, and pdfs_search_pdf, with code_execution_code_exec used for numerical work. Code execution rises in prominence on harder tasks, which demand more quantitative analysis.
MC (spreadsheet-heavy) — rollouts are dominated by excel_read_tab, excel_list_tabs_in_spreadsheet, and filesystem_search_files, navigating multi-tab financial models. The document tool word_read_document_content also appears on memo-writing tasks.

Training utility

Supervised fine-tuning (SFT) a smaller open-weight model on successful APEX Agent trajectories yields substantial lifts on held-out tasks from the public Mercor APEX-Agents benchmark — a third-party eval set distinct from the training corpus. The IB+MC and Law slices were trained separately because they exercise different agent capabilities — IB+MC emphasizes data, formula, and numerical computation, while Law emphasizes information retrieval and text-based reasoning over long documents. Training data. The dataset provides ~1,000 samples per domain (task + trajectory + reward verifier + environment) for SFT or RL; the hard tier additionally supports RL via its environments and verifiers. The proof-of-concept SFT run below used a 500-per-domain subset drawn from the easy and medium difficulty tiers — half the available SFT-compatible samples per slice. Base model: Qwen3.6-27B. Evaluation sampling: pass@1, max-steps=30, temperature=1.0. Metrics. Strict pass = fraction of rollouts where every rubric criterion was passed (reward = 1.0). Mean reward = average fraction of rubric criteria passed per rollout (captures partial credit). IB+MC results (320 held-out tasks: 160 IB + 160 MC):

Model	IB strict pass	IB mean	MC strict pass	MC mean	Overall strict pass	Overall mean
Qwen3.6-27B base	9.4% (15/160)	0.127	4.4% (7/160)	0.132	6.9% (22/320)	0.130
+ SFT on IB+MC	12.5% (20/160)	0.168	8.8% (14/160)	0.234	10.6% (34/320)	0.201

Strict pass improves +3.7 pp overall (≈54% relative). MC sees the largest lift — strict pass doubles (4.4% → 8.8%) and mean reward rises +77% relative, reflecting that MC tasks (Excel-tab navigation, multi-row trajectories) benefit most from the cohort-trained protocol. Law results (160 held-out tasks):

Model	Strict pass	Mean reward
Qwen3.6-27B base	6.9% (11/160)	0.207
+ SFT on Law	12.5% (20/160)	0.314

Strict pass nearly doubles (6.9% → 12.5%, +5.6 pp, ≈82% relative), and mean reward rises +0.107. The baseline already produced partial credit on many Law tasks, and SFT concentrates on pushing partial-credit rollouts into fully-correct ones. Caveats. Baseline rollouts were graded by claude-haiku-4.5; SFT rollouts by the stricter claude-sonnet-4-5 (≈2–3 pp lower strict pass on shared spot-checks), so the SFT lifts above are a lower bound. All results are pass@1; the headline deltas of 3.7–5.6 pp are robust to sampling noise.

Access & licensing

The full APEX Agent corpus — all environments, tasks, rubrics, and trajectories — is available for commercial licensing, including model training. For licensing, contact support@eigenai.com. A free 10-task sample is available now under the CC BY-NC-ND 4.0 license — see Demo Samples.

​What APEX Agent is

​At a glance

​What’s inside

​Difficulty profile

​How challenging is the data

​Trajectory length

​Tool usage

​Training utility

​Access & licensing

What APEX Agent is

At a glance

What’s inside

Difficulty profile

How challenging is the data

Trajectory length

Tool usage

Training utility

Access & licensing