Overview
| Property | Value |
|---|---|
| Domain | Banking (Rho-Bank) |
| Tasks | 10 banking conversations with executable evaluators |
| Turn type | Multi-turn, with discoverable (dynamic) tools |
| Scenarios | Card disputes, debit-card activation, PIN reset, interest/APY inquiries, reward-point checks |
| Grading | Per-task Python evaluator — DB-diff + rubric + policy, all LLM-judged |
Environment
The banking environment is an MCP-backed database with ~24 tables — users, bank accounts, credit-card accounts, debit cards, transactions, disputes, verification history, and more. Each task ships a reference payload containing the full initial state plus the reference diff a correct trajectory should produce, so grading is deterministic.Banking environment — ~24 tables, discoverable-tool pool
Banking environment — ~24 tables, discoverable-tool pool
User entry:Credit-card account entry:Debit card entry (pending activation):The database also includes
debit_cards, transaction_disputes, verification_history, credit_card_transaction_history, bank_account_transaction_history, referrals, credit_card_applications, interest_discrepancy_reports, and more. Sensitive operations (activating a card, filing a dispute, looking up card digits) are not in the base tool set — they live in a discoverable-tool pool the agent unlocks at runtime.Data
Each task is described across four parallel files (sharing an index):| Directory | File | Description |
|---|---|---|
intent/ | 000001.json | Goal, customer profile, persona, motivations, and hard constraints |
datapoints/ | 000001.json | The full multi-turn dialog with reasoning and tool calls |
evaluators/ | evaluator_000001.py | The Python grading script (DB-diff + rubrics + policy) |
reference_payloads/ | reference_payload_000001.json | Initial DB state + reference diff for deterministic grading |
Sample
Sample tasks (goals)
Sample tasks (goals)
| Task | Goal |
|---|---|
| File a dispute for a defective Best Buy charge (correct category, not fraud), and activate a pending new debit card with a chosen PIN | |
| Understand how Silver Account savings interest works and confirm the applicable APY tier | |
| Ask whether Rho-Bank offers business checking and whether it must be a new account | |
| Activate a replacement debit card (issued after the previous was stolen) and verify identity first | |
| Activate a new debit card ending in 2847 by providing card details and a PIN | |
| Reset a forgotten PIN on an active card (forgotten-PIN flow) to 8527 | |
| Activate a reissued debit card using the printed expiration date and CVV | |
| Investigate a missing goodwill credit tied to a savings-interest issue | |
| Activate a debit card mailed to replace an expired one, with a chosen PIN | |
| Inquire about a savings-interest correction and review reward points on two cards |
issue_reason and discover the matching activation tool.Full task intent (task 000001)
Full task intent (task 000001)
Full agent trajectory (task 000001 — 47 messages, 19 tool calls)
Full agent trajectory (task 000001 — 47 messages, 19 tool calls)
A compact view of the dispute-and-activation trajectory. Note the recurring discover → unlock → call pattern for every sensitive operation.User:Assistant — ground both operations in the knowledge base:Assistant — discover and call hidden tools to gather account context:Assistant — activate the new-account debit card (matching User pushes back on the dispute category (“it’s defective, not just ‘not as described’”). The agent re-checks the KB, explains the available categories honestly, and refuses to invent one.Assistant — retrieve the card’s last 4 digits from a tool (not from the user), per constraint:Assistant — present the full dispute for confirmation:
I think I need to file a dispute? I bought something from Best Buy on my Silver Rewards Card and it arrived defective. I also got a new debit card I need to activate.Assistant — verify identity first (KB-grounded policy):
issue_reason), with PIN 4821:
Here’s what I’ll file: Best Buy charge txn_b8c9d0e1f2a3, category goods/services not as described, full refund requested, card ending 1961. Shall I file it?
User:Yes, file the dispute.Assistant — file only after explicit confirmation:
Both done — your debit card is activated, and the Best Buy dispute has been filed.
Evaluator rubrics & grading (task 000001)
Evaluator rubrics & grading (task 000001)
The per-task Python evaluator carries the reference diff and a list of typed rubrics, then runs three LLM-judged checks.Sample rubrics:Three-axis grading:
overall_pass requires all three checks. Because grading runs against the reference snapshot, the score is deterministic and usable as an RL reward.Download
Browse on Hugging Face
View Tau3-Bench banking files
For the complete Tau3-Bench corpus — the discoverable-tool mechanic, benchmark difficulty, and commercial licensing — see the Full Dataset page.