Skip to main content
These demo samples are a free 10-task-per-domain slice of the Tau2-Bench dataset. For the complete dataset — its scale, trajectory profile, and difficulty — see the Full Dataset page. The samples below showcase all three domains — airline, telecom, and retail — where the agent must authenticate the customer, drive a multi-turn conversation against a simulated backend, and stay inside a written service policy.

Overview

PropertyValue
DomainsAirline, Telecom, Retail
Tasks10 per domain, with gold evaluation criteria
Turn typeMulti-turn, user-simulator driven
ScenariosAirline: booking, modification, cancellation, baggage, refunds · Telecom: device troubleshooting, roaming, data, billing · Retail: returns, exchanges, address/payment changes
GradingAction- and state-level checks against the reference action sequence

Environment

Each domain ships in the tau-bench four-folder layoutintent / datapoints / evaluators / reference_payloads, joined by a shared NNNNNN id (000001000010). The self-contained backend database the agent reads and writes is embedded in each sample’s reference_payloads/ as a full initial-state snapshot, so every trajectory replays deterministically against the same world.
tau2_bench/
├── tau2-airline/
│   ├── intent/              # 000001.json … : goal, profile, persona, motivations, constraints
│   ├── datapoints/          # SFT-ready multi-turn dialog (+reasoning, tool_calls) + tool_schema
│   ├── evaluators/          # evaluator_NNNNNN.py : action- & state-level grading script
│   └── reference_payloads/  # full backend DB snapshot (initial state) + reference diff
├── tau2-telecom/            # same four folders (plans, lines, devices, bills + simulated handset)
└── tau2-retail/             # same four folders (products, users, orders)
User entry:
{
  "user_id": "olivia_lewis_66ef61",
  "name": { "first_name": "Olivia", "last_name": "Lewis" },
  "address": { "address1": "456 Oak Street", "address2": "Apt 12", "city": "Chicago", "state": "IL", "zip": "60601" },
  "email": "olivia.lewis@example.com",
  "dob": "1985-07-15",
  "payment_methods": {
    "certificate_7891234": { "source": "certificate", "amount": 150.0 },
    "gift_card_1234567": { "source": "gift_card", "amount": 80.0 }
  }
}
Flight entry:
{
  "flight_number": "HAT001",
  "origin": "PHL",
  "destination": "LGA",
  "dates": {
    "2024-05-16": {
      "status": "available",
      "available_seats": { "basic_economy": 16, "economy": 10, "business": 13 },
      "prices": { "basic_economy": 87, "economy": 122, "business": 471 }
    }
  }
}
Cabin classes are basic economy, economy, and business; membership levels are regular, silver, and gold. The agent must keep full price precision and obtain confirmation before any booking, modification, or cancellation.
Plan entry (telecom_db.toml):
[[plans]]
plan_id = "P1002"
name = "Premium Plan"
data_limit_gb = 15.0
price_per_month = 65.0
data_refueling_price_per_gb = 2.0
Device entry:
[[devices]]
device_id = "D1001"
device_type = "phone"
model = "Smartphone X"
imei = "123456789012345"
Beyond the account database (plans, lines, customers, bills), telecom uniquely exposes a simulated handset the agent inspects and reconfigures: airplane mode, network mode, APN settings, Wi-Fi and Wi-Fi calling, VPN, SIM status, data saver, app permissions, and reboot. Troubleshooting an MMS failure or “No Service” means running this device through a diagnostic sequence.
Product entry:
{
  "name": "Laptop",
  "product_id": "4760268021",
  "variants": {
    "2216662955": {
      "item_id": "2216662955",
      "options": { "screen size": "15-inch", "processor": "i5", "ram": "32GB", "storage": "256GB SSD", "color": "space grey" },
      "available": true,
      "price": 2520.52
    }
  }
}
Order entry:
{
  "order_id": "#W8797321",
  "user_id": "omar_johnson_2562",
  "status": "pending",
  "items": [
    { "name": "Indoor Security Camera", "item_id": "8470360507", "price": 291.31 },
    { "name": "Laptop", "item_id": "3478699712", "price": 2291.87 }
  ],
  "payment_history": [
    { "transaction_type": "payment", "amount": 2583.18, "payment_method_id": "gift_card_9532915" }
  ]
}
Orders move through pending → processed → delivered → cancelled, and the available action depends on status: pending orders can be modified or cancelled, delivered orders can be returned or exchanged.

Data

Each task is described across four parallel files that share an NNNNNN id:
FolderFileDescription
intent/000001.jsonThe task: customer goal, profile, persona, motivations, and hard constraints (the user-simulator scenario)
datapoints/000001.jsonThe SFT-ready multi-turn dialog — system policy, user-simulator turns, assistant messages with reasoning, tool calls, and tool results — plus the domain tool_schema
evaluators/evaluator_000001.pyThe per-task grading script — action- and state-level checks against the reference action sequence
reference_payloads/reference_payload_000001.jsonThe full backend database snapshot (initial state) plus the reference diff for deterministic grading

Sample

TaskScenario
Change reservation HKEG34 to the best nonstop option, then add two checked bags
Upgrade cabin from basic economy → business on an existing reservation
Shift all outbound legs to depart after 10:00 while keeping the return unchanged
Move reservation SGTEEY to the earliest LGA→PHL flight before May 20
Correct a passenger name and add checked baggage
Book the cheapest direct morning DEN→ORD one-way in economy, 1 bag, with travel insurance
Upgrade from basic economy → economy on an upcoming round trip
TaskScenario
Cannot send MMS for the past few hours — diagnose and successfully send a picture message
Phone shows “No Service” — restore connectivity
Mobile data drops or is very slow — must reach excellent speed, no Wi-Fi available
Roaming not working abroad — enable roaming and confirm data works
Refuel data after hitting the plan limit
Billing and data-usage inquiry, with line suspend/resume on request
TaskScenario
Swap a basketball variant (leather → composite, same size) on a pending order, pay the difference on file
Cancel order #W6175707; if shipped, modify the suitcase to a larger soft-shell instead
Update an order’s address; on learning it’s delivered, cancel it instead
Update a pending order’s shipping address, then switch its payment method
Return a Duvet; realize the mix-up and return a Pillow from a different order instead
Change a moisturizer to a different skin type; exchange a scarf for a shorter length
Cancel an order; if delivered, return the bamboo cutting board for a refund
These scenarios deliberately include fallback intents — the customer’s first request is often out of policy (e.g. swapping to a completely different product, cancelling a shipped order), and the agent must recognize this and pivot to the allowed alternative.
Each assistant message includes reasoning (chain-of-thought) showing the agent’s step-by-step decisions. Reasoning is shown in italics below.User:
Hi there! I need to update my shipping address in my account. I’d like to change it to 123 Oak Street, Apartment 5B, Dallas, Texas 75201. Can you help me with that?
Assistant:Per policy, I must authenticate the user before changing anything on their account — locate their user id via email, or name + zip.
I’d be happy to help update your default shipping address. But first I need to verify your identity. Could you provide the email address on your account?
User:
Sure! My email address is amara.diallo5790@example.com.
Assistant — authenticate:
find_user_id_by_email({"email": "amara.diallo5790@example.com"})
"amara_diallo_2400"
Great, you’re verified. Just to confirm, I’ll set your default shipping address to 123 Oak Street, Apt 5B, Dallas, TX 75201 — shall I go ahead?
User:
Yes, that’s correct! Please go ahead and update it. Also, I had a cancelled order #W0877487 for a Bird Feeder — I’m curious what the price difference would be for a different variant.
Assistant — update address, then look up the order and product (the user confirmed the write):
modify_user_address({"user_id": "amara_diallo_2400", "address1": "123 Oak Street", "address2": "Apartment 5B", "city": "Dallas", "state": "TX", "country": "USA", "zip": "75201"})
get_order_details({"order_id": "#W0877487"})
get_product_details({"product_id": "8222567674"})
calculate({"expression": "15.56 - 21.32"})
Done — your default address is now 123 Oak Street, Apt 5B, Dallas, TX 75201. On the Bird Feeder, the variant you mentioned is $5.76 cheaper than the one on the cancelled order.
User:
Could you tell me about Tennis Rackets too? And could you place a new order for the Bird Feeder variant and a Tennis Racket?
Assistant — provide product info:
list_all_product_types({})
get_product_details({"product_id": "2751056608"})
Here are the available Tennis Racket variants and prices… (lists variants)
User:
That’s disappointing you can’t place the order. Is there someone else who could help me place it?
Assistant:Policy is explicit: I can cancel/modify pending orders, return/exchange delivered orders, and modify the default address — but I cannot place a brand-new order. The request is outside my scope, so I transfer to a human.
transfer_to_human_agents({"summary": "Customer Amara Diallo wants to place a new order for a Bird Feeder (platform/metal/6lb) and a Tennis Racket."})
YOU ARE BEING TRANSFERRED TO A HUMAN AGENT. PLEASE HOLD ON.
This trajectory shows the full pattern: authenticate → confirm before writing → answer read-only questions → refuse-and-transfer when a request (placing a new order) falls outside the agent’s tools.
RL tasks ship a user-simulator scenario and the gold action sequence a passing trajectory must reproduce.User-simulator scenario:
{
  "task_instructions": "You are Noah Khan (user_id: noah_khan_8166) and you want to upgrade your upcoming economy flight (SFO → MCO on 2024-05-26) to business class. Provide reservation_id 'JU4FIC' and payment 'credit_card_5669132' when asked, and confirm 'yes' before the change is made.",
  "reason_for_call": "Upgrade cabin class from economy to business for existing reservation"
}
Evaluation criteria (gold actions):
{
  "actions": [
    { "name": "get_user_details", "arguments": { "user_id": "noah_khan_8166" } },
    { "name": "get_reservation_details", "arguments": { "reservation_id": "JU4FIC" } },
    { "name": "search_onestop_flight", "arguments": { "date": "2024-05-26", "origin": "SFO", "destination": "MCO" } },
    { "name": "update_reservation_flights", "arguments": {
        "reservation_id": "JU4FIC", "cabin": "business", "payment_id": "credit_card_5669132",
        "flights": [ { "flight_number": "HAT144", "price": 419 }, { "flight_number": "HAT181", "price": 216 } ] } }
  ],
  "communicate_info": []
}
A trajectory passes when it executes the gold state-changing actions (with matching arguments) and communicates any required facts — graded against this reference rather than a free-text rubric.

Download

# Download all Tau2-Bench demo data
hf download jindidi/eigendata-demo-data --repo-type dataset --include "tau2_bench/*"

# Download a specific domain
hf download jindidi/eigendata-demo-data --repo-type dataset --include "tau2_bench/tau2-airline/*"

Browse on Hugging Face

View Tau2-Bench files
For the complete Tau2-Bench corpus — all environments, SFT dialogs, RL tasks, and commercial licensing — see the Full Dataset page. For the harder single-domain banking slice, see Tau3-Bench.