Full Dataset

Inspired by τ²-bench, the Tau2-Bench dataset is a corpus of multi-turn, tool-using customer-service conversations generated by EigenData-CLI. It spans three domains — airline, telecom, and retail — each with a self-contained simulated backend, a written agent policy, a user simulator, and machine-checkable success criteria. The Tau2-Bench corpus contributes 2,250 samples to the larger 3,000-sample Tau-Bench family (the remaining 750 are the harder, single-domain Tau3-Bench). Each domain ships 750 samples: 500 SFT-ready dialogs (full agent trajectory included) plus 250 RL-only tasks (environment + evaluation criteria, no gold trajectory).

Want to try it first? A free 10-task-per-domain sample is available on the Demo Samples page.

What Tau2-Bench is

Each sample is a conversation between a customer and an AI service agent. The agent is given a domain policy and a fixed tool set, and it must satisfy the customer’s request — booking a flight, troubleshooting a phone, returning an order — by calling backend tools while staying inside policy. A user simulator plays the customer across many turns, supplying details only when asked, changing its mind, and pushing back, so the agent has to drive a real conversation rather than answer a single prompt. Three properties make the tasks hard:

Policy compliance. The agent must obtain explicit confirmation before any write, refuse out-of-policy requests, and transfer to a human when (and only when) a request falls outside its tools.
Multi-turn user simulation. Customers reveal intent gradually, batch several requests together, and frequently switch to a fallback goal when the first one is not allowed.
Stateful environments. Tools read and write a shared database, so a correct final answer requires a correct sequence of actions, not just a correct last message.

Task categories

Domain	Scenario coverage	Environment	Tools
Airline	Book, modify, and cancel reservations; cabin upgrades; baggage changes; passenger edits; refunds and travel-certificate compensation	Flights, users, and reservations database (4 DB variants)	14
Telecom	Device-level troubleshooting — no-service, slow data, MMS failures, roaming; plan, bill, and data-usage handling; line suspend/resume	Customers, lines, devices, plans, and bills, plus a simulated handset (airplane mode, APN, Wi-Fi, VPN, SIM, reboot)	43
Retail	Order tracking, returns, exchanges, cancellations; address and payment-method changes; product and variant lookups	Products, users, and orders database	16

At a glance

Property	Value
Domains	Airline, Telecom, Retail
Samples	2,250 total — 750 per domain (500 SFT-ready dialogs + 250 RL-only tasks). Part of the 3,000-sample Tau-Bench family.
Task format	Multi-turn dialog driven by a user simulator + agent tool calls against a stateful backend
Environments	MCP-backed databases — flights/reservations, telecom plans + simulated device, orders/products
Agent tools	14 (airline) / 43 (telecom) / 16 (retail) function-calling tools
Grading	Action- and state-level checks against per-task gold evaluation criteria

What’s inside

Component	Description
SFT dialogs	500 per domain. Each is a complete multi-turn trajectory — system policy, user simulator turns, assistant messages with chain-of-thought reasoning, tool calls, and tool results — ready for supervised fine-tuning.
RL tasks	250 per domain. Each ships a task purpose, a user-simulator scenario, the database to mount, and `evaluation_criteria` (the gold action sequence + required communicated info) — RL-trainable without a gold trajectory.
Environments	Self-contained domain databases that the tools read and write. The airline domain ships four DB variants; telecom ships a plans/lines/devices/bills world plus a simulated handset.
Tool schemas	The full function-calling tool set available to the agent in each domain (e.g. `book_reservation`, `update_reservation_flights`; `check_apn_settings`, `reboot_device`; `return_delivered_order_items`, `modify_pending_order_items`).
Evaluation criteria	The reference actions and communicated facts a passing trajectory must produce — usable as a reward signal.

Trajectory length

Tau2-Bench dialogs are genuinely long-horizon. The table below summarizes the SFT dialogs per domain — messages per dialog (system + user + assistant + tool), user-simulator turns, and agent tool calls — as mean / median / p90.

Domain	Messages/dialog	User turns	Tool calls
Airline	37 / 37 / 49	8.8	11.7 / 11 / 17
Telecom	74 / 72 / 105	29.5	6.2 / 6 / 8
Retail	36 / 34 / 45	9.1	8.7 / 8 / 11

Telecom dialogs are the longest (median 72 messages, ~30 user turns) but use the fewest tool calls — troubleshooting is a slow, conversational back-and-forth where each device check is interpreted before the next.
Airline dialogs make the most tool calls (median 11) — booking and modification require chained lookups (user → reservations → flights) before a single write.
Retail sits in between, with multi-request dialogs that mix lookups, calculations, and writes.

Tool usage

Each domain exercises a distinct slice of its tool set, reflecting its source material:

Airline — dominated by get_reservation_details, search_direct_flight, and get_flight_status, with update_reservation_baggages, book_reservation, and update_reservation_flights performing the writes after confirmation.
Telecom — dominated by get_details_by_id and get_customer_by_phone for identification, then device-state actions like refuel_data, enable_roaming, and get_data_usage; transfer_to_human_agents appears when a fix is out of scope.
Retail — dominated by get_order_details and get_product_details, with find_user_id_by_email/get_user_details for authentication and exchange_delivered_order_items, return_delivered_order_items, modify_user_address, and cancel_pending_order performing the changes.

How challenging is the data

The difficulty is not in any single tool call — it is in doing the whole conversation correctly. A passing trajectory must authenticate the user, gather requirements across many turns, take the exact gold sequence of state-changing actions, obtain confirmation before each write, and refuse or transfer when policy requires it. A single skipped confirmation, an out-of-policy action, or a wrong fallback when the user changes their mind fails the task. Frontier models reliably handle the read-only lookups but stumble on the end-to-end requirements — correct database mutations and full policy adherence and goal completion in one trajectory. The harder, single-domain banking slice is quantified on the Tau3-Bench page, where frontier models pass under 30% of tasks.

Training utility

Training studies on the Tau2-Bench corpus show that SFT bootstrap + verifiable-reward RL lifts open-weight models to frontier-competitive performance across all three domains, using EigenData’s executable per-task verifiers as the RL reward. Results are reported as Pass^k — the probability that all k independent trials of a task succeed (a strict reliability metric, not one lucky rollout). In every case the RL training tasks are generated from environments whose databases are held out from the benchmark test set (same tool schemas and operational rules, different users / orders / products). The pattern is consistent across domains: SFT delivers the first large jump (broad tool-use competence — tool selection, schema-valid arguments, multi-turn state tracking), and a second RL stage adds further gains that widen at stricter Pass^k — the agent becomes not just more often right, but more reliably right across repeated trials.

The Retail comparison below ranks against the current public τ²-Bench leaderboard. The Airline and Telecom figures were graded under the τ²-bench evaluation harness at the time of measurement; the public board has since tightened its grader (the v1.0.0 “Task Quality” release) and added newer model versions, so treat those two as same-evaluation comparisons rather than current-board rankings.

Airline

Self-evolving synthetic SFT data plus GRPO RL with executable verifiers (From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents), base Qwen3-30B-A3B-Thinking-2507:

Model	Pass^1	Pass^2	Pass^3	Pass^4
Base	56.0	42.7	36.0	32.0
+ SFT	60.0	49.0	42.5	38.0
+ SFT + RL	70.5	61.7	56.0	52.0

Scaling the same recipe to Qwen3-235B-A22B-Thinking-2507 reaches 73.0 Pass^1 — matching Gemini 3.0 Pro (73.0) and exceeding GPT-5 (62.5).

Retail

A targeted error-driven RL recipe (base Qwen3-30B-A3B-Thinking-2507) trains on tasks synthesized from the model’s own recurring post-SFT failures — wrong product variant, mixed item identifiers across orders, missing confirmation or execution steps:

Model	Pass^1	Pass^2	Pass^3	Pass^4
Base	54.2	44.7	39.5	34.2
+ SFT	75.9	64.5	57.5	52.6
+ SFT + error-driven RL	82.5	73.0	66.2	61.4

RL adds +6.6 Pass^1 over SFT (+28.3 over base), widening at stricter consistency (Pass^4 52.6 → 61.4). Reward design matters: an ablation to task-success-only collapses to 66.7 Pass^1 (the model exploits degenerate tool-call loops), and removing the repeated-tool-call penalty drops Pass^4 to 50.0. How it ranks. That result places our 30B model second on the current public τ²-Bench Retail leaderboard — behind only a 397B open model, and ahead of GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro:

#	Model	Org	Pass^1
1	Qwen3.5-397B-A17B	Alibaba Cloud	84.4
2	Tau2-Bench Retail SFT+RL — Qwen3-30B-A3B-Thinking	EigenAI	82.5
3	GPT-5.2	OpenAI	81.6
4	Claude Opus 4.5	Anthropic	79.6
5	Gemini 3 Flash	Google	76.8
6	Gemini 3 Pro	Google	75.9
7	GLM-5	Zhipu AI	73.7
8	Claude Sonnet 4.5	Anthropic	72.4

A 30B model outranking GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro — second only to a model 13× its size.

Telecom

Same self-evolving SFT + RL recipe (From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents), base Qwen3-30B-A3B-Thinking-2507:

Model	Pass^1	Pass^2	Pass^3	Pass^4
Base	28.5	20.2	18.4	—
+ SFT	85.4	78.8	73.5	70.8
+ SFT + RL	95.6	91.8	88.6	86.0

Telecom sees the largest swing — SFT alone lifts Pass^1 from 28.5 to 85.4, and RL pushes it to 95.6 (Pass^4 70.8 → 86.0). Scaling to Qwen3-235B-A22B-Thinking-2507 reaches 98.3 Pass^1, matching or exceeding every frontier model on the leaderboard (Claude Sonnet 4.5 and Gemini 3.0 Pro at 98.0, GPT-5 at 95.8).

Access & licensing

The full Tau2-Bench corpus — all environments, SFT dialogs, RL tasks, tool schemas, and evaluation criteria — is available for commercial licensing, including model training. For licensing, contact support@eigenai.com. A free 10-task-per-domain sample is available now under the CC BY-NC-ND 4.0 license — see Demo Samples.

Eigen AI

API Reference

Platform

Products

What Tau2-Bench is

Task categories

At a glance

What’s inside

Trajectory length

Tool usage

How challenging is the data

Training utility

Airline

Retail

Telecom

Access & licensing

​What Tau2-Bench is

​Task categories

​At a glance

​What’s inside

​Trajectory length

​Tool usage

​How challenging is the data

​Training utility

​Airline

​Retail

​Telecom

​Access & licensing

What Tau2-Bench is

Task categories

At a glance

What’s inside

Trajectory length

Tool usage

How challenging is the data

Training utility

Airline

Retail

Telecom

Access & licensing