Overview
| Property | Value |
|---|---|
| Domains | Airline, Telecom, Retail |
| Tasks | 10 per domain, with gold evaluation criteria |
| Turn type | Multi-turn, user-simulator driven |
| Scenarios | Airline: booking, modification, cancellation, baggage, refunds · Telecom: device troubleshooting, roaming, data, billing · Retail: returns, exchanges, address/payment changes |
| Grading | Action- and state-level checks against the reference action sequence |
Environment
Each domain ships in the tau-bench four-folder layout —intent / datapoints / evaluators / reference_payloads, joined by a shared NNNNNN id (000001–000010). The self-contained backend database the agent reads and writes is embedded in each sample’s reference_payloads/ as a full initial-state snapshot, so every trajectory replays deterministically against the same world.
Airline environment — 300 flights, 500 users, 2000 reservations
Airline environment — 300 flights, 500 users, 2000 reservations
User entry:Flight entry:Cabin classes are basic economy, economy, and business; membership levels are regular, silver, and gold. The agent must keep full price precision and obtain confirmation before any booking, modification, or cancellation.
Telecom environment — plans, lines, devices, customers, bills + a simulated handset
Telecom environment — plans, lines, devices, customers, bills + a simulated handset
Plan entry (Device entry:Beyond the account database (plans, lines, customers, bills), telecom uniquely exposes a simulated handset the agent inspects and reconfigures: airplane mode, network mode, APN settings, Wi-Fi and Wi-Fi calling, VPN, SIM status, data saver, app permissions, and reboot. Troubleshooting an MMS failure or “No Service” means running this device through a diagnostic sequence.
telecom_db.toml):Retail environment — 190 products, 490 users, 2000 orders
Retail environment — 190 products, 490 users, 2000 orders
Product entry:Order entry:Orders move through pending → processed → delivered → cancelled, and the available action depends on status: pending orders can be modified or cancelled, delivered orders can be returned or exchanged.
Data
Each task is described across four parallel files that share anNNNNNN id:
| Folder | File | Description |
|---|---|---|
intent/ | 000001.json | The task: customer goal, profile, persona, motivations, and hard constraints (the user-simulator scenario) |
datapoints/ | 000001.json | The SFT-ready multi-turn dialog — system policy, user-simulator turns, assistant messages with reasoning, tool calls, and tool results — plus the domain tool_schema |
evaluators/ | evaluator_000001.py | The per-task grading script — action- and state-level checks against the reference action sequence |
reference_payloads/ | reference_payload_000001.json | The full backend database snapshot (initial state) plus the reference diff for deterministic grading |
Sample
Airline — sample tasks
Airline — sample tasks
| Task | Scenario |
|---|---|
| Change reservation HKEG34 to the best nonstop option, then add two checked bags | |
| Upgrade cabin from basic economy → business on an existing reservation | |
| Shift all outbound legs to depart after 10:00 while keeping the return unchanged | |
| Move reservation SGTEEY to the earliest LGA→PHL flight before May 20 | |
| Correct a passenger name and add checked baggage | |
| Book the cheapest direct morning DEN→ORD one-way in economy, 1 bag, with travel insurance | |
| Upgrade from basic economy → economy on an upcoming round trip |
Telecom — sample tasks
Telecom — sample tasks
| Task | Scenario |
|---|---|
| Cannot send MMS for the past few hours — diagnose and successfully send a picture message | |
| Phone shows “No Service” — restore connectivity | |
| Mobile data drops or is very slow — must reach excellent speed, no Wi-Fi available | |
| Roaming not working abroad — enable roaming and confirm data works | |
| Refuel data after hitting the plan limit | |
| Billing and data-usage inquiry, with line suspend/resume on request |
Retail — sample tasks
Retail — sample tasks
| Task | Scenario |
|---|---|
| Swap a basketball variant (leather → composite, same size) on a pending order, pay the difference on file | |
| Cancel order #W6175707; if shipped, modify the suitcase to a larger soft-shell instead | |
| Update an order’s address; on learning it’s delivered, cancel it instead | |
| Update a pending order’s shipping address, then switch its payment method | |
| Return a Duvet; realize the mix-up and return a Pillow from a different order instead | |
| Change a moisturizer to a different skin type; exchange a scarf for a shorter length | |
| Cancel an order; if delivered, return the bamboo cutting board for a refund |
Full retail dialog (29 messages, 8 tool calls) — address update + product lookups + policy-correct transfer
Full retail dialog (29 messages, 8 tool calls) — address update + product lookups + policy-correct transfer
Each assistant message includes
reasoning (chain-of-thought) showing the agent’s step-by-step decisions. Reasoning is shown in italics below.User:Hi there! I need to update my shipping address in my account. I’d like to change it to 123 Oak Street, Apartment 5B, Dallas, Texas 75201. Can you help me with that?Assistant:Per policy, I must authenticate the user before changing anything on their account — locate their user id via email, or name + zip.
I’d be happy to help update your default shipping address. But first I need to verify your identity. Could you provide the email address on your account?User:
Sure! My email address is amara.diallo5790@example.com.Assistant — authenticate:
Great, you’re verified. Just to confirm, I’ll set your default shipping address to 123 Oak Street, Apt 5B, Dallas, TX 75201 — shall I go ahead?User:
Yes, that’s correct! Please go ahead and update it. Also, I had a cancelled order #W0877487 for a Bird Feeder — I’m curious what the price difference would be for a different variant.Assistant — update address, then look up the order and product (the user confirmed the write):
Done — your default address is now 123 Oak Street, Apt 5B, Dallas, TX 75201. On the Bird Feeder, the variant you mentioned is $5.76 cheaper than the one on the cancelled order.User:
Could you tell me about Tennis Rackets too? And could you place a new order for the Bird Feeder variant and a Tennis Racket?Assistant — provide product info:
Here are the available Tennis Racket variants and prices… (lists variants)User:
That’s disappointing you can’t place the order. Is there someone else who could help me place it?Assistant:Policy is explicit: I can cancel/modify pending orders, return/exchange delivered orders, and modify the default address — but I cannot place a brand-new order. The request is outside my scope, so I transfer to a human.
YOU ARE BEING TRANSFERRED TO A HUMAN AGENT. PLEASE HOLD ON.This trajectory shows the full pattern: authenticate → confirm before writing → answer read-only questions → refuse-and-transfer when a request (placing a new order) falls outside the agent’s tools.
RL evaluation criteria (airline cabin upgrade)
RL evaluation criteria (airline cabin upgrade)
RL tasks ship a user-simulator scenario and the gold action sequence a passing trajectory must reproduce.User-simulator scenario:Evaluation criteria (gold actions):A trajectory passes when it executes the gold state-changing actions (with matching arguments) and communicates any required facts — graded against this reference rather than a free-text rubric.
Download
Browse on Hugging Face
View Tau2-Bench files
For the complete Tau2-Bench corpus — all environments, SFT dialogs, RL tasks, and commercial licensing — see the Full Dataset page. For the harder single-domain banking slice, see Tau3-Bench.