Overview
| Property | Airline | Retail | Banking |
|---|---|---|---|
| Samples | 10 | 10 | 10 |
| Turn type | Multi-turn | Multi-turn | Multi-turn |
| Scenarios | Booking, cancellation, baggage, refunds, flight changes | Order tracking, returns, exchanges, address changes, product upgrades | Account management, transfers, disputes, credit cards, loan inquiries |
Environment
Each sub-domain includes areference_payloads/ directory containing MCP server state snapshots — the full database state used during generation. These represent the simulated backend that the agent interacts with via tool calls.
Airline environment — 300 flights, 500 users, 2000 reservations
Airline environment — 300 flights, 500 users, 2000 reservations
User entry:Flight entry:Reservation entry:
Retail environment — 50 products, 500 users, 1000 orders
Retail environment — 50 products, 500 users, 1000 orders
Product entry:User entry:Order entry:
Banking environment — 24 tables including users, accounts, credit cards, transactions
Banking environment — 24 tables including users, accounts, credit cards, transactions
User entry:Account entry:Credit card account entry:Database also includes:
debit_cards, referrals, credit_card_applications, verification_history, credit_card_transaction_history, cash_back_disputes, bank_account_transaction_history, transaction_disputes, payment_history, credit_card_orders, debit_card_orders, credit_card_closure_reasons, credit_card_account_flags, credit_limit_increase_requests, debit_card_disputes, interest_discrepancy_reports, and more.Data
Each sub-domain contains:| Directory | Description |
|---|---|
datapoints/ | Generated multi-turn dialogs with tool calls |
intent/ | User intent definitions — goals, profiles, personas, and motivations |
evaluators/ | Python grading scripts that verify agent responses |
reference_payloads/ | Full environment snapshots for deterministic replay |
Sample
Full retail dialog (24 messages, 7 tool calls)
Full retail dialog (24 messages, 7 tool calls)
User:Tool results:Step 2 — Final answer:I have confirmed the order is pending and associated with the user. Next I need the new shipping address details to proceed with the update.Tool result:Step 2 — Final answer:The tool confirms the address update succeeded, so I should acknowledge completion and offer further help.Tool result:Step 2 — Final answer:Using the product variants, I identified the closest in-stock upgrades relative to the customer’s current 15-inch space grey configuration and computed the price deltas from $2,291.87 to each available variant, highlighting the best match that keeps size/color.Tool result:Step 2 — Final answer:I should reassure the user about the upgrade being in stock and clarify the exact variant and price difference. For the return, I now have the vacuum cleaner item_id and the PayPal payment method from the delivered order, but I need explicit confirmation to proceed with both actions.Tool results:Step 2 — Final answer:Both tool calls succeeded, so I should confirm the laptop swap (including the $0.50 charge) and confirm the return request and refund destination/amount, then offer further assistance.
Hi — can you help me update the shipping address on my pending order W8797321? My account email is omar.johnson6791@example.com.Assistant:Step 1 — Look up user and order in parallel:User wants to update the shipping address on pending order W8797321 and provided their email. I should look up their user ID and fetch the order details to confirm status and ownership before proceeding.
Got it! Order #W8797321 is pending, so we can definitely update the address. What’s the new shipping address you’d like to use?User:
Yep — please change it to 912 Elm Street, Suite 173, Jacksonville, FL 32228.Assistant:I have the full new address from the user, but I should confirm it before making a change to a pending order.
Just to confirm, I’ll update the shipping address on order #W8797321 to: 912 Elm Street, Suite 173, Jacksonville, FL 32228, USA Does that look right? Just say yes and I’ll make the change!User:
Yes, that’s correct — please go ahead and update it to 912 Elm Street, Suite 173, Jacksonville, FL 32228.Assistant:Step 1 — Update shipping address:User confirmed the address update, so I will submit the address change for the pending order with the confirmed address fields.
Done! The shipping address on order #W8797321 has been updated to 912 Elm Street, Suite 173, Jacksonville, FL 32228. Is there anything else I can help you with?User:
Thanks. Next question — on the laptop in that order, what upgrade options are available for that model, and what would the price difference be if I bumped up the processor or storage?Assistant:Step 1 — Fetch product variants:To answer upgrade options and price differences for the laptop in the order, I need the product’s variant list and pricing, so I will fetch product details for product_id 4760268021.
Here’s what I found for available upgrades on your laptop (currently 15-inch, i5, 16GB RAM, 512GB SSD, space grey at $2,291.87): Keeping the same 15-inch space grey look, there’s one great available option — i7 processor, 32GB RAM, 1TB SSD at $2,292.37 — that’s only $0.50 more than your current config!User:
Whoa, how is the i7/32GB/1TB space grey basically the same price? If that’s real, yes I want to swap to that upgrade. Also, I want to return the Vacuum Cleaner from my other order W2809253 — can you do that too?Assistant:Step 1 — Fetch second order for return:The user wants to proceed with a return from delivered order W2809253, so I need that order’s details (item IDs and payment method) to start the return.
Yes, it’s real! The 15-inch i7/32GB RAM/1TB SSD in space grey (item 6017636844) is priced at $2,292.37 — just $0.50 more than your current laptop. And it is in stock. For the return of the Vacuum Cleaner from order #W2809253 (delivered), I can process that too. Which payment method should receive the refund — your gift card or PayPal?User:
Yes, please go ahead with both. Swap the laptop on order W8797321 to the 15-inch space grey i7/32GB/1TB option, and then start the return for the Vacuum Cleaner from order W2809253 with the refund going to PayPal.Assistant:Step 1 — Swap laptop item and initiate return in parallel:User confirmed both actions. I will (1) modify the pending order to swap the laptop variant and charge the small price difference to the gift card on file, and (2) initiate a return for the vacuum cleaner from the delivered order with refund to PayPal.
Both done! Here’s a quick summary: Your laptop on order #W8797321 has been swapped to the 15-inch i7/32GB/1TB SSD space grey — $0.50 was charged to your gift card for the difference. The return for the Vacuum Cleaner on order #W2809253 has been submitted — you’ll receive an email with return instructions, and the refund will go to your PayPal account.
User scenario for the above dialog
User scenario for the above dialog
Environment snapshot (MCP server state)
Environment snapshot (MCP server state)
Each reference payload contains the full MCP server database state — products, users, and orders — used during the conversation. Here is a condensed view:Example user entry:Example order entry:
Evaluator excerpt (~960 lines Python, three-part evaluation)
Evaluator excerpt (~960 lines Python, three-part evaluation)
Each sample ships with an auto-generated evaluator that scores submissions on three axes:Part 1 — Database state comparison (Part 2 — Rubrics evaluation (
db_check): deep-compares the final MCP server state against the reference payload using a structured diff. Fields are classified by type — exact match for IDs, amounts, and statuses; semantic match for free-text; ignore for timestamps and tokens.rubrics_check): evaluates the trajectory against task-specific rubrics covering goal achievement, required process steps, and step ordering.Part 3 — Domain policy compliance (policy_check): verifies the agent followed domain rules — e.g., authenticating the user before taking actions, confirming changes before executing, only modifying orders with the correct status.The combined verdict:Evaluation Results
We evaluated the generated data by running frontier models as agents against the synthesized environments and grading with the auto-generated evaluators. Each evaluator scores on three axes: DB Check (correct database mutations), Rubrics (goal and process compliance), and Policy (domain policy adherence). All Three requires passing all three checks. Airline| Model | DB Check | Rubrics | Policy | DB + Rubrics | All Three |
|---|---|---|---|---|---|
| Gemini-3.1-Pro-Preview | 90.0 | 60.0 | 70.0 | 60.0 | 30.0 |
| Claude-Opus-4-6 | 80.0 | 60.0 | 70.0 | 60.0 | 40.0 |
| Grok-4.20 | 80.0 | 30.0 | 20.0 | 30.0 | 10.0 |
| GPT-5.3-Codex | 60.0 | 50.0 | 50.0 | 50.0 | 40.0 |
| Model | DB Check | Rubrics | Policy | DB + Rubrics | All Three |
|---|---|---|---|---|---|
| Claude-Opus-4-6 | 50.0 | 40.0 | 80.0 | 20.0 | 20.0 |
| GPT-5.3-Codex | 10.0 | 10.0 | 30.0 | 0.0 | 0.0 |
| Model | DB Check | Rubrics | Policy | DB + Rubrics | All Three |
|---|---|---|---|---|---|
| Grok-4.20 | 80.0 | 80.0 | 60.0 | 60.0 | 30.0 |
| Claude-Opus-4-6 | 70.0 | 50.0 | 90.0 | 40.0 | 30.0 |
| Gemini-3.1-Pro-Preview | 70.0 | 30.0 | 60.0 | 30.0 | 10.0 |
| GPT-5.3-Codex | 60.0 | 60.0 | 50.0 | 40.0 | 30.0 |
Download
Browse on Hugging Face
View Tau2-Bench files