Want to try it first? A free 10-task sample is available on the Demo Samples page.
What Tau3-Bench is
Each sample is a multi-turn banking conversation in which the agent must complete one or more high-stakes operations — filing a card dispute, activating a debit card, resetting a PIN, investigating an interest discrepancy — while satisfying a strict, knowledge-base-grounded policy. Three things make Tau3-Bench substantially harder than Tau2-Bench:- Discoverable tools. The operation the user wants is performed by a tool that is not in the base list. The agent must call
KB_searchto learn the tool exists,unlock_discoverable_agent_toolto make it available, andcall_discoverable_agent_toolto run it — with the correct arguments. - Mandatory identity verification. Before any account-specific action, the agent must look up the user, match identity fields against what the customer provides, and
log_verificationwith a timestamp fromget_current_time. - Policy grounded in a knowledge base. Rules like provisional-credit eligibility, the correct dispute category, or which activation flow matches a card’s
issue_reasonlive in the KB, not the prompt. The agent must retrieve and apply them, and obtain explicit confirmation before high-impact writes.
Task categories
Tau3-Bench is a single domain (banking), but spans a wide range of banking sub-tasks, often combined within one conversation:| Sub-task | What the agent must do |
|---|---|
| Card-transaction disputes | File a dispute with the correct category (e.g. goods/services not as described vs. fraud vs. billing-amount), determine provisional-credit eligibility from the KB, and confirm before filing |
| Debit-card activation | Pick the activation flow that matches the card’s issue_reason (new account vs. replacement vs. reissue), validate card details, normalize the expiration date, and set a PIN |
| PIN reset | Run the forgotten-PIN flow on an active card and set a new PIN |
| Account & interest inquiries | Investigate savings-interest discrepancies, confirm APY tiers, check reward points, and handle cases where the expected account does not actually exist |
| Profile & referral operations | Update contact details, look up referrals — all behind identity verification |
At a glance
| Property | Value |
|---|---|
| Domain | Banking (Rho-Bank), single domain |
| Samples | 750 — each with an intent, a full dialog, an executable evaluator, and a reference payload. Part of the 3,000-sample Tau-Bench family. |
| Task format | Multi-turn dialog with discoverable (dynamic) tools + KB-grounded policy |
| Environment | MCP-backed banking database (~24 tables) with full initial-state snapshots |
| Agent tools | 15 base tools (incl. the discoverable-tool gateway) + a runtime-unlockable tool pool |
| Grading | Per-task Python evaluator — DB-diff check + rubric check + policy check, all LLM-judged |
What’s inside
| Component | Description |
|---|---|
| Datapoints | 750 complete multi-turn dialogs — system policy, user turns, assistant messages with chain-of-thought reasoning, tool calls (including discover/unlock/call), and tool results. SFT-ready. |
| Intents | A structured spec per task — goal, customer profile, persona, motivations, and hard constraints (the rules the trajectory must respect). |
| Evaluators | An auto-generated Python grading script per task. It computes the database diff, applies a list of typed rubrics, and runs a policy check — returning a pass/fail per axis plus an overall verdict. |
| Reference payloads | The full environment snapshot per task — the initial database state and the reference diff a correct trajectory should produce — enabling deterministic, executable grading. |
| Tool schemas | The 15 base tools (KB search, identity lookups, verification, the discoverable-tool gateway) plus the pool of runtime-unlockable operations. |
Discoverable tools — the τ³ mechanic
The base tool set deliberately does not include the operations that change account state. Instead it includes a gateway:| Base tool | Role |
|---|---|
KB_search | Search the knowledge base for policy and for the name of the tool that performs an operation |
list_discoverable_agent_tools | List which hidden tools exist |
unlock_discoverable_agent_tool | Make a named hidden tool callable |
call_discoverable_agent_tool | Invoke an unlocked tool with arguments |
KB_search (find the policy + the tool name file_credit_card_transaction_dispute_4829) → unlock_discoverable_agent_tool → call_discoverable_agent_tool. The agent must discover the right tool (e.g. the new-account activation tool, not the replacement one) and supply correct arguments it has gathered from earlier lookups. This is why discovery and KB tools dominate the trajectories below.
Trajectory length
Banking dialogs are long and tool-dense. The table reports the datapoints as mean / median / p90.| Metric | Mean / Median / p90 |
|---|---|
| Messages per dialog (system + user + assistant + tool) | 36 / 35 / 52 |
| Agent tool calls | 14.2 / 13 / 22 |
| User turns | 5.3 / 5 / — |
| Characters per dialog | ~72,500 (the longest, most reasoning-heavy in the family) |
Tool usage
Banking trajectories are dominated by discovery and verification, not by the final write:KB_searchis the single most-used tool — the agent constantly grounds its policy decisions and finds tool names in the knowledge base.unlock_discoverable_agent_toolandcall_discoverable_agent_toolare next — every state change goes through the discover-then-call path.get_current_time+log_verificationappear in nearly every trajectory, anchoring the mandatory identity-verification step.- Identity lookups (
get_user_information_by_name/_by_id/_by_email) and account reads (get_credit_card_accounts_by_user,get_credit_card_transactions_by_user) gather the arguments the discovered tools need.
How challenging is the data
Banking is the hardest slice in the Tau-Bench family. A passing trajectory has to clear every gate at once: verify identity correctly, discover and unlock the right hidden tool, apply KB-grounded policy (the correct dispute category, the matching activation flow, provisional-credit eligibility), gather exact arguments from prior lookups, and obtain explicit confirmation before high-impact writes. Evaluated on the banking corpus, frontier models pass under 30% of tasks:| Model | Pass rate |
|---|---|
| opus-4.7 | < 30% |
| gpt-5.5 | < 30% |
Evaluation method
Each task’s auto-generated Python evaluator scores a trajectory on three independent axes, then combines them:| Check | What it verifies |
|---|---|
| DB check | The database mutations produced by the trajectory match the reference diff (e.g. the dispute record and the debit card flipping PENDING → ACTIVE) |
| Rubrics check | Typed rubrics pass — goal (the operation succeeded with correct parameters), process (verification done first), process_order (confirmation before filing; time fetched before logging), and anti_pattern (things that must not occur) |
| Policy check | The trajectory adheres to the domain policy throughout |
overall_pass requires all three checks to pass. Because grading is executable and snapshot-based, results are deterministic and reproducible — and usable directly as an RL reward signal.
Training utility
Supervised fine-tuning (SFT) a smaller open-weight model on Tau3-Bench banking trajectories yields strict-pass gains on the knowledge-grounded Banking domain, where success depends on retrieving and applying the right product, procedure, or policy from the large Rho-Bank knowledge base (~700 docs). Evaluation runs 97 tasks withnum_trials=1, max_steps=200, and a gpt-5.5 user simulator (temperature = 0), graded by combined database-state and per-action checks.
Training data. ~628 SFT samples (latest iteration) in multi-turn messages format, produced after expanding knowledge-base and policy coverage. The system prompt embeds the full Rho-Bank customer-service <policy> document, and each assistant turn carries an explicit reasoning field that works through policy compliance (e.g. requiring identity verification before acting on an account). Dialogues are genuinely multi-turn and long-horizon — typically ~11–19 assistant turns and ~9–21 tool calls — using banking tools such as KB_search, identity-verification, and account-lookup calls, which teaches the retrieve-then-act pattern the domain rewards. Base model: Qwen3.6-27B.
Metric. Strict pass@1 (reward = 1.0: database state and all action checks pass), temperature = 1.0.
Results. Strict pass@1 improves from 10/97 (baseline) → 11/97 (an early 189-task iteration) → 15/97 (latest) — +5 tasks, ≈+50% relative.
How it compares. On the public τ-Bench banking leaderboard (banking_knowledge, Pass^1), our ~27B model lands right among models many times its size:
| Model | Org | Pass^1 |
|---|---|---|
| Grok 4.2 | xAI | 17.6 |
| Gemini 3 Pro | 15.7 | |
| Tau3-Bench SFT — Qwen3.6-27B | EigenAI | 15.5 |
| Grok 4 fast | xAI | 14.2 |
| Gemini 2.5 Pro | 12.8 | |
| Grok 4.1 fast | xAI | 12.4 |