Full Dataset

Tau3-Bench is the hard, single-domain extension of the Tau-Bench family. It targets a realistic retail-banking agent (“Rho-Bank”) and adds the defining τ³ mechanic: dynamically discoverable tools. Instead of being handed every capability up front, the agent is given a small base tool set and must search a knowledge base, unlock the right hidden tool at runtime, and only then call it — the way real enterprise agents gate sensitive operations. Tau3-Bench contributes 750 samples to the 3,000-sample Tau-Bench family (the other 2,250 are the three-domain Tau2-Bench). Every sample is fully instrumented: a structured intent, a complete dialog, an executable Python evaluator, and a reference payload (full environment snapshot) for deterministic grading.

Want to try it first? A free 10-task sample is available on the Demo Samples page.

What Tau3-Bench is

Each sample is a multi-turn banking conversation in which the agent must complete one or more high-stakes operations — filing a card dispute, activating a debit card, resetting a PIN, investigating an interest discrepancy — while satisfying a strict, knowledge-base-grounded policy. Three things make Tau3-Bench substantially harder than Tau2-Bench:

Discoverable tools. The operation the user wants is performed by a tool that is not in the base list. The agent must call KB_search to learn the tool exists, unlock_discoverable_agent_tool to make it available, and call_discoverable_agent_tool to run it — with the correct arguments.
Mandatory identity verification. Before any account-specific action, the agent must look up the user, match identity fields against what the customer provides, and log_verification with a timestamp from get_current_time.
Policy grounded in a knowledge base. Rules like provisional-credit eligibility, the correct dispute category, or which activation flow matches a card’s issue_reason live in the KB, not the prompt. The agent must retrieve and apply them, and obtain explicit confirmation before high-impact writes.

Task categories

Tau3-Bench is a single domain (banking), but spans a wide range of banking sub-tasks, often combined within one conversation:

Sub-task	What the agent must do
Card-transaction disputes	File a dispute with the correct category (e.g. goods/services not as described vs. fraud vs. billing-amount), determine provisional-credit eligibility from the KB, and confirm before filing
Debit-card activation	Pick the activation flow that matches the card’s `issue_reason` (new account vs. replacement vs. reissue), validate card details, normalize the expiration date, and set a PIN
PIN reset	Run the forgotten-PIN flow on an active card and set a new PIN
Account & interest inquiries	Investigate savings-interest discrepancies, confirm APY tiers, check reward points, and handle cases where the expected account does not actually exist
Profile & referral operations	Update contact details, look up referrals — all behind identity verification

At a glance

Property	Value
Domain	Banking (Rho-Bank), single domain
Samples	750 — each with an intent, a full dialog, an executable evaluator, and a reference payload. Part of the 3,000-sample Tau-Bench family.
Task format	Multi-turn dialog with discoverable (dynamic) tools + KB-grounded policy
Environment	MCP-backed banking database (~24 tables) with full initial-state snapshots
Agent tools	15 base tools (incl. the discoverable-tool gateway) + a runtime-unlockable tool pool
Grading	Per-task Python evaluator — DB-diff check + rubric check + policy check, all LLM-judged

What’s inside

Component	Description
Datapoints	750 complete multi-turn dialogs — system policy, user turns, assistant messages with chain-of-thought reasoning, tool calls (including discover/unlock/call), and tool results. SFT-ready.
Intents	A structured spec per task — `goal`, customer `profile`, `persona`, `motivations`, and hard `constraints` (the rules the trajectory must respect).
Evaluators	An auto-generated Python grading script per task. It computes the database diff, applies a list of typed rubrics, and runs a policy check — returning a pass/fail per axis plus an overall verdict.
Reference payloads	The full environment snapshot per task — the initial database state and the reference diff a correct trajectory should produce — enabling deterministic, executable grading.
Tool schemas	The 15 base tools (KB search, identity lookups, verification, the discoverable-tool gateway) plus the pool of runtime-unlockable operations.

Discoverable tools — the τ³ mechanic

The base tool set deliberately does not include the operations that change account state. Instead it includes a gateway:

Base tool	Role
`KB_search`	Search the knowledge base for policy and for the name of the tool that performs an operation
`list_discoverable_agent_tools`	List which hidden tools exist
`unlock_discoverable_agent_tool`	Make a named hidden tool callable
`call_discoverable_agent_tool`	Invoke an unlocked tool with arguments

So filing a dispute looks like: KB_search (find the policy + the tool name file_credit_card_transaction_dispute_4829) → unlock_discoverable_agent_tool → call_discoverable_agent_tool. The agent must discover the right tool (e.g. the new-account activation tool, not the replacement one) and supply correct arguments it has gathered from earlier lookups. This is why discovery and KB tools dominate the trajectories below.

Trajectory length

Banking dialogs are long and tool-dense. The table reports the datapoints as mean / median / p90.

Metric	Mean / Median / p90
Messages per dialog (system + user + assistant + tool)	36 / 35 / 52
Agent tool calls	14.2 / 13 / 22
User turns	5.3 / 5 / —
Characters per dialog	~72,500 (the longest, most reasoning-heavy in the family)

With only ~5 user turns but ~14 tool calls per dialog, the agent does most of the work autonomously between user messages — searching the KB, unlocking tools, and chaining lookups — which is exactly where models go wrong.

Tool usage

Banking trajectories are dominated by discovery and verification, not by the final write:

KB_search is the single most-used tool — the agent constantly grounds its policy decisions and finds tool names in the knowledge base.
unlock_discoverable_agent_tool and call_discoverable_agent_tool are next — every state change goes through the discover-then-call path.
get_current_time + log_verification appear in nearly every trajectory, anchoring the mandatory identity-verification step.
Identity lookups (get_user_information_by_name / _by_id / _by_email) and account reads (get_credit_card_accounts_by_user, get_credit_card_transactions_by_user) gather the arguments the discovered tools need.

How challenging is the data

Banking is the hardest slice in the Tau-Bench family. A passing trajectory has to clear every gate at once: verify identity correctly, discover and unlock the right hidden tool, apply KB-grounded policy (the correct dispute category, the matching activation flow, provisional-credit eligibility), gather exact arguments from prior lookups, and obtain explicit confirmation before high-impact writes. Evaluated on the banking corpus, frontier models pass under 30% of tasks:

Model	Pass rate
opus-4.7	< 30%
gpt-5.5	< 30%

The dominant failure modes are exactly the τ³ additions: failing to discover/unlock the correct tool, choosing the wrong policy branch from the KB, skipping or mis-logging identity verification, and writing before confirmation. This is what makes the slice a strong training and evaluation signal.

Evaluation method

Each task’s auto-generated Python evaluator scores a trajectory on three independent axes, then combines them:

Check	What it verifies
DB check	The database mutations produced by the trajectory match the reference diff (e.g. the dispute record and the debit card flipping `PENDING → ACTIVE`)
Rubrics check	Typed rubrics pass — `goal` (the operation succeeded with correct parameters), `process` (verification done first), `process_order` (confirmation before filing; time fetched before logging), and `anti_pattern` (things that must not occur)
Policy check	The trajectory adheres to the domain policy throughout

overall_pass requires all three checks to pass. Because grading is executable and snapshot-based, results are deterministic and reproducible — and usable directly as an RL reward signal.

Training utility

Supervised fine-tuning (SFT) a smaller open-weight model on Tau3-Bench banking trajectories yields strict-pass gains on the knowledge-grounded Banking domain, where success depends on retrieving and applying the right product, procedure, or policy from the large Rho-Bank knowledge base (~700 docs). Evaluation runs 97 tasks with num_trials=1, max_steps=200, and a gpt-5.5 user simulator (temperature = 0), graded by combined database-state and per-action checks. Training data. ~628 SFT samples (latest iteration) in multi-turn messages format, produced after expanding knowledge-base and policy coverage. The system prompt embeds the full Rho-Bank customer-service <policy> document, and each assistant turn carries an explicit reasoning field that works through policy compliance (e.g. requiring identity verification before acting on an account). Dialogues are genuinely multi-turn and long-horizon — typically ~11–19 assistant turns and ~9–21 tool calls — using banking tools such as KB_search, identity-verification, and account-lookup calls, which teaches the retrieve-then-act pattern the domain rewards. Base model: Qwen3.6-27B. Metric. Strict pass@1 (reward = 1.0: database state and all action checks pass), temperature = 1.0. Results. Strict pass@1 improves from 10/97 (baseline) → 11/97 (an early 189-task iteration) → 15/97 (latest) — +5 tasks, ≈+50% relative. How it compares. On the public τ-Bench banking leaderboard (banking_knowledge, Pass^1), our ~27B model lands right among models many times its size:

Model	Org	Pass^1
Grok 4.2	xAI	17.6
Gemini 3 Pro	Google	15.7
Tau3-Bench SFT — Qwen3.6-27B	EigenAI	15.5
Grok 4 fast	xAI	14.2
Gemini 2.5 Pro	Google	12.8
Grok 4.1 fast	xAI	12.4

A ~27B model matches Gemini 3 Pro and xAI’s Grok-fast tier on this knowledge-grounded domain — and beats a 397B open model (Qwen3.5-397B, 9.8) by more than five points. The frontier still leads (GPT-5.5 37.4, GPT-5.4 30.7, Claude Opus 4.7 25.3), which is exactly the headroom further training targets. What drove it. The gains are on the knowledge-grounded service tasks — those where the agent must first locate the right product/procedure/policy in the knowledge base and then act on it. Two representative wins: (1) product recommendation — matching a customer’s profile (spend pattern, income, no-annual-fee preference) to the right Rho-Bank credit card by looking up product details in the knowledge base; and (2) procedure execution — walking a traveling customer through setting up a travel notification so their card isn’t blocked abroad, following the documented steps. Expanding the knowledge-base and policy coverage in the training data teaches the model to retrieve the correct policy and issue accurate, policy-compliant tool calls, so it satisfies both the database-state and the per-action checks more often.

Access & licensing

The full Tau3-Bench corpus — all banking environments, dialogs, intents, evaluators, and reference payloads — is available for commercial licensing, including model training. For licensing, contact support@eigenai.com. A free 10-task sample is available now under the CC BY-NC-ND 4.0 license — see Demo Samples.

Eigen AI

API Reference

Platform

Products

What Tau3-Bench is

Task categories

At a glance

What’s inside

Discoverable tools — the τ³ mechanic

Trajectory length

Tool usage

How challenging is the data

Evaluation method

Training utility

Access & licensing

​What Tau3-Bench is

​Task categories

​At a glance

​What’s inside

​Discoverable tools — the τ³ mechanic

​Trajectory length

​Tool usage

​How challenging is the data

​Evaluation method

​Training utility

​Access & licensing

What Tau3-Bench is

Task categories

At a glance

What’s inside

Discoverable tools — the τ³ mechanic

Trajectory length

Tool usage

How challenging is the data

Evaluation method

Training utility

Access & licensing