Want to try it first? A free 10-task-per-domain sample is available on the Demo Samples page.
What Tau2-Bench is
Each sample is a conversation between a customer and an AI service agent. The agent is given a domain policy and a fixed tool set, and it must satisfy the customer’s request — booking a flight, troubleshooting a phone, returning an order — by calling backend tools while staying inside policy. A user simulator plays the customer across many turns, supplying details only when asked, changing its mind, and pushing back, so the agent has to drive a real conversation rather than answer a single prompt. Three properties make the tasks hard:- Policy compliance. The agent must obtain explicit confirmation before any write, refuse out-of-policy requests, and transfer to a human when (and only when) a request falls outside its tools.
- Multi-turn user simulation. Customers reveal intent gradually, batch several requests together, and frequently switch to a fallback goal when the first one is not allowed.
- Stateful environments. Tools read and write a shared database, so a correct final answer requires a correct sequence of actions, not just a correct last message.
Task categories
| Domain | Scenario coverage | Environment | Tools |
|---|---|---|---|
| Airline | Book, modify, and cancel reservations; cabin upgrades; baggage changes; passenger edits; refunds and travel-certificate compensation | Flights, users, and reservations database (4 DB variants) | 14 |
| Telecom | Device-level troubleshooting — no-service, slow data, MMS failures, roaming; plan, bill, and data-usage handling; line suspend/resume | Customers, lines, devices, plans, and bills, plus a simulated handset (airplane mode, APN, Wi-Fi, VPN, SIM, reboot) | 43 |
| Retail | Order tracking, returns, exchanges, cancellations; address and payment-method changes; product and variant lookups | Products, users, and orders database | 16 |
At a glance
| Property | Value |
|---|---|
| Domains | Airline, Telecom, Retail |
| Samples | 2,250 total — 750 per domain (500 SFT-ready dialogs + 250 RL-only tasks). Part of the 3,000-sample Tau-Bench family. |
| Task format | Multi-turn dialog driven by a user simulator + agent tool calls against a stateful backend |
| Environments | MCP-backed databases — flights/reservations, telecom plans + simulated device, orders/products |
| Agent tools | 14 (airline) / 43 (telecom) / 16 (retail) function-calling tools |
| Grading | Action- and state-level checks against per-task gold evaluation criteria |
What’s inside
| Component | Description |
|---|---|
| SFT dialogs | 500 per domain. Each is a complete multi-turn trajectory — system policy, user simulator turns, assistant messages with chain-of-thought reasoning, tool calls, and tool results — ready for supervised fine-tuning. |
| RL tasks | 250 per domain. Each ships a task purpose, a user-simulator scenario, the database to mount, and evaluation_criteria (the gold action sequence + required communicated info) — RL-trainable without a gold trajectory. |
| Environments | Self-contained domain databases that the tools read and write. The airline domain ships four DB variants; telecom ships a plans/lines/devices/bills world plus a simulated handset. |
| Tool schemas | The full function-calling tool set available to the agent in each domain (e.g. book_reservation, update_reservation_flights; check_apn_settings, reboot_device; return_delivered_order_items, modify_pending_order_items). |
| Evaluation criteria | The reference actions and communicated facts a passing trajectory must produce — usable as a reward signal. |
Trajectory length
Tau2-Bench dialogs are genuinely long-horizon. The table below summarizes the SFT dialogs per domain — messages per dialog (system + user + assistant + tool), user-simulator turns, and agent tool calls — as mean / median / p90.| Domain | Messages/dialog | User turns | Tool calls |
|---|---|---|---|
| Airline | 37 / 37 / 49 | 8.8 | 11.7 / 11 / 17 |
| Telecom | 74 / 72 / 105 | 29.5 | 6.2 / 6 / 8 |
| Retail | 36 / 34 / 45 | 9.1 | 8.7 / 8 / 11 |
- Telecom dialogs are the longest (median 72 messages, ~30 user turns) but use the fewest tool calls — troubleshooting is a slow, conversational back-and-forth where each device check is interpreted before the next.
- Airline dialogs make the most tool calls (median 11) — booking and modification require chained lookups (user → reservations → flights) before a single write.
- Retail sits in between, with multi-request dialogs that mix lookups, calculations, and writes.
Tool usage
Each domain exercises a distinct slice of its tool set, reflecting its source material:- Airline — dominated by
get_reservation_details,search_direct_flight, andget_flight_status, withupdate_reservation_baggages,book_reservation, andupdate_reservation_flightsperforming the writes after confirmation. - Telecom — dominated by
get_details_by_idandget_customer_by_phonefor identification, then device-state actions likerefuel_data,enable_roaming, andget_data_usage;transfer_to_human_agentsappears when a fix is out of scope. - Retail — dominated by
get_order_detailsandget_product_details, withfind_user_id_by_email/get_user_detailsfor authentication andexchange_delivered_order_items,return_delivered_order_items,modify_user_address, andcancel_pending_orderperforming the changes.
How challenging is the data
The difficulty is not in any single tool call — it is in doing the whole conversation correctly. A passing trajectory must authenticate the user, gather requirements across many turns, take the exact gold sequence of state-changing actions, obtain confirmation before each write, and refuse or transfer when policy requires it. A single skipped confirmation, an out-of-policy action, or a wrong fallback when the user changes their mind fails the task. Frontier models reliably handle the read-only lookups but stumble on the end-to-end requirements — correct database mutations and full policy adherence and goal completion in one trajectory. The harder, single-domain banking slice is quantified on the Tau3-Bench page, where frontier models pass under 30% of tasks.Training utility
Training studies on the Tau2-Bench corpus show that SFT bootstrap + verifiable-reward RL lifts open-weight models to frontier-competitive performance across all three domains, using EigenData’s executable per-task verifiers as the RL reward. Results are reported as Pass^k — the probability that all k independent trials of a task succeed (a strict reliability metric, not one lucky rollout). In every case the RL training tasks are generated from environments whose databases are held out from the benchmark test set (same tool schemas and operational rules, different users / orders / products). The pattern is consistent across domains: SFT delivers the first large jump (broad tool-use competence — tool selection, schema-valid arguments, multi-turn state tracking), and a second RL stage adds further gains that widen at stricter Pass^k — the agent becomes not just more often right, but more reliably right across repeated trials.The Retail comparison below ranks against the current public τ²-Bench leaderboard. The Airline and Telecom figures were graded under the τ²-bench evaluation harness at the time of measurement; the public board has since tightened its grader (the v1.0.0 “Task Quality” release) and added newer model versions, so treat those two as same-evaluation comparisons rather than current-board rankings.
Airline
Self-evolving synthetic SFT data plus GRPO RL with executable verifiers (From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents), base Qwen3-30B-A3B-Thinking-2507:| Model | Pass^1 | Pass^2 | Pass^3 | Pass^4 |
|---|---|---|---|---|
| Base | 56.0 | 42.7 | 36.0 | 32.0 |
| + SFT | 60.0 | 49.0 | 42.5 | 38.0 |
| + SFT + RL | 70.5 | 61.7 | 56.0 | 52.0 |
Retail
A targeted error-driven RL recipe (base Qwen3-30B-A3B-Thinking-2507) trains on tasks synthesized from the model’s own recurring post-SFT failures — wrong product variant, mixed item identifiers across orders, missing confirmation or execution steps:| Model | Pass^1 | Pass^2 | Pass^3 | Pass^4 |
|---|---|---|---|---|
| Base | 54.2 | 44.7 | 39.5 | 34.2 |
| + SFT | 75.9 | 64.5 | 57.5 | 52.6 |
| + SFT + error-driven RL | 82.5 | 73.0 | 66.2 | 61.4 |
| # | Model | Org | Pass^1 |
|---|---|---|---|
| 1 | Qwen3.5-397B-A17B | Alibaba Cloud | 84.4 |
| 2 | Tau2-Bench Retail SFT+RL — Qwen3-30B-A3B-Thinking | EigenAI | 82.5 |
| 3 | GPT-5.2 | OpenAI | 81.6 |
| 4 | Claude Opus 4.5 | Anthropic | 79.6 |
| 5 | Gemini 3 Flash | 76.8 | |
| 6 | Gemini 3 Pro | 75.9 | |
| 7 | GLM-5 | Zhipu AI | 73.7 |
| 8 | Claude Sonnet 4.5 | Anthropic | 72.4 |
Telecom
Same self-evolving SFT + RL recipe (From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents), base Qwen3-30B-A3B-Thinking-2507:| Model | Pass^1 | Pass^2 | Pass^3 | Pass^4 |
|---|---|---|---|---|
| Base | 28.5 | 20.2 | 18.4 | — |
| + SFT | 85.4 | 78.8 | 73.5 | 70.8 |
| + SFT + RL | 95.6 | 91.8 | 88.6 | 86.0 |