Want to try it first? A free 10-task sample is available on the Demo Samples page.
What APEX Agent is
APEX Agent is a corpus of long-horizon, tool-using tasks in finance and law. Each task is a single user prompt describing a deliverable — a number, a memo, a filled-in spreadsheet, a legal analysis — paired with a virtual workspace mounted with the relevant files (PDFs, spreadsheets, documents). The agent works the task using tools for filesystem, PDF reading, spreadsheets, document editing, and code execution, then returns a final answer that is graded against a hand-written rubric. Tasks fall into three streams:- IB (Investment Banking) — questions about company filings, 10-Ks, equity research notes, and deal memos. PDF-heavy.
- MC (Management Consulting) — questions about financial models in Excel: navigating multi-tab workbooks, applying formulas, and producing analysis. Spreadsheet-heavy.
- Law — long-document analysis: locating relevant clauses, citing them correctly, and chaining textual facts into an argument. Text-retrieval-heavy.
At a glance
| Property | Value |
|---|---|
| Domains | Investment Banking, Management Consulting, Law |
| Samples | ~1,000 per domain (≈3,000 total) — task + trajectory + reward verifier + environment, usable for SFT or RL. The IB+MC hard tier additionally provides RL-only samples (env + verifier, no gold trajectory). |
| Task format | Single prompt + virtual workspace + grading rubric |
| Workspace file types | PDF, XLSX, CSV, DOCX, PPTX |
| Agent tools | Filesystem, PDF reading, spreadsheets, documents, code execution |
| Grading | Rubric of 3–6 criteria per task; reward = criteria passed ÷ total |
What’s inside
| Component | Description |
|---|---|
| Samples | The unit of release. Each sample bundles a task, the workspace environment, and a reward-verification function. ~1,000 samples per domain additionally include a successful agent trajectory (SFT-ready, also RL-trainable). The hard-tier samples lack a gold trajectory but remain RL-trainable via the environment + verifier. |
| Environments | A self-contained virtual workspace per task — domain files organized into folders (e.g. one management-consulting environment holds 104 files across 7 categories). |
| Tasks & rubrics | Each task ships a prompt, a gold response, and 3–6 rubric criteria the answer must satisfy. |
| Trajectories | Agent rollouts with chain-of-thought reasoning and tool calls — the shortest pass-rate-1.0 rollout per task is shipped as the trajectory in the corresponding sample. |
| Reward verifiers | Executable verification functions that score rollouts against the task rubric — usable as a reward signal for RL. |
| Tool schemas | Definitions for the filesystem, spreadsheet, document, and code-execution tools available to the agent. |
Difficulty profile
To characterize the dataset’s difficulty distribution, a 2,759-task sub-sample of IB+MC tasks was difficulty-classified by running a strong open-weight baseline agent. Each task is assigned to one of three tiers — where “solve” means a perfect score, every rubric criterion passed:| Tier | Definition |
|---|---|
| Easy | The baseline agent solves it unaided. |
| Medium | The baseline fails unaided, but succeeds when given short protocol guidance. |
| Hard | Neither mode solves it. |
| Tier | IB | MC | Total | Share |
|---|---|---|---|---|
| Easy | 898 | 718 | 1,616 | 58.6% |
| Medium | 243 | 236 | 479 | 17.4% |
| Hard | 296 | 368 | 664 | 24.1% |
| Total | 1,437 | 1,322 | 2,759 | 100% |
How challenging is the data
As a reference point, frontier closed-source models were evaluated on sampled subsets of the IB+MC corpus. Balanced pilot — 100 tasks (50 IB + 50 MC):| Model | All | IB | MC | Avg reward |
|---|---|---|---|---|
| opus-4.7 | 69.0% | 82.0% | 56.0% | 0.890 |
| gpt-5.5 | 62.0% | 68.0% | 56.0% | 0.825 |
| Model | Pass rate |
|---|---|
| opus-4.7 | 30.0% |
| gpt-5.5 | 28.0% |
Trajectory length
APEX Agent tasks are genuinely long-horizon. The table below summarizes, for IB+MC baseline rollouts, the assistant turns (steps) and tool calls per rollout — shown as mean / median / p90, broken out by domain and difficulty tier:| Tier | IB steps | MC steps | IB tool calls | MC tool calls |
|---|---|---|---|---|
| Easy | 22.1 / 19 / 45 | 17.2 / 15 / 30 | 25.6 / 22 / 50 | 22.3 / 20 / 37 |
| Medium | 17.0 / 15 / 30 | 16.9 / 16 / 28 | 20.3 / 20 / 32 | 21.7 / 20 / 34 |
| Hard | 17.1 / 16 / 30 | 19.1 / 18 / 30 | 19.6 / 18 / 32 | 24.4 / 24 / 38 |
- IB easy tasks have the longest trajectories (mean 22 steps, p90 45) — reading PDFs and 10-Ks involves many page reads.
- MC trajectories are shorter and more uniform (mean 17–19 steps) — spreadsheet navigation is more direct.
- Tool calls exceed step counts throughout, since a single assistant turn can issue several tool calls in parallel.
Tool usage
The IB and MC streams exercise different tools, reflecting their different source material:- IB (PDF-heavy) — rollouts are dominated by
pdfs_read_pdf_pages,filesystem_search_files, andpdfs_search_pdf, withcode_execution_code_execused for numerical work. Code execution rises in prominence on harder tasks, which demand more quantitative analysis. - MC (spreadsheet-heavy) — rollouts are dominated by
excel_read_tab,excel_list_tabs_in_spreadsheet, andfilesystem_search_files, navigating multi-tab financial models. The document toolword_read_document_contentalso appears on memo-writing tasks.
Training utility
Supervised fine-tuning (SFT) a smaller open-weight model on successful APEX Agent trajectories yields substantial lifts on held-out tasks from the public Mercor APEX-Agents benchmark — a third-party eval set distinct from the training corpus. The IB+MC and Law slices were trained separately because they exercise different agent capabilities — IB+MC emphasizes data, formula, and numerical computation, while Law emphasizes information retrieval and text-based reasoning over long documents. Training data. The dataset provides ~1,000 samples per domain (task + trajectory + reward verifier + environment) for SFT or RL; the hard tier additionally supports RL via its environments and verifiers. The proof-of-concept SFT run below used a 500-per-domain subset drawn from the easy and medium difficulty tiers — half the available SFT-compatible samples per slice. Base model: Qwen3.6-27B. Evaluation sampling:pass@1, max-steps=30, temperature=1.0.
Metrics. Strict pass = fraction of rollouts where every rubric criterion was passed (reward = 1.0). Mean reward = average fraction of rubric criteria passed per rollout (captures partial credit).
IB+MC results (320 held-out tasks: 160 IB + 160 MC):
| Model | IB strict pass | IB mean | MC strict pass | MC mean | Overall strict pass | Overall mean |
|---|---|---|---|---|---|---|
| Qwen3.6-27B base | 9.4% (15/160) | 0.127 | 4.4% (7/160) | 0.132 | 6.9% (22/320) | 0.130 |
| + SFT on IB+MC | 12.5% (20/160) | 0.168 | 8.8% (14/160) | 0.234 | 10.6% (34/320) | 0.201 |
| Model | Strict pass | Mean reward |
|---|---|---|
| Qwen3.6-27B base | 6.9% (11/160) | 0.207 |
| + SFT on Law | 12.5% (20/160) | 0.314 |
claude-haiku-4.5; SFT rollouts by the stricter claude-sonnet-4-5 (≈2–3 pp lower strict pass on shared spot-checks), so the SFT lifts above are a lower bound. All results are pass@1; the headline deltas of 3.7–5.6 pp are robust to sampling noise.