Demo Samples

These demo samples are a free 10-task-per-domain slice of the Tau2-Bench dataset. For the complete dataset — its scale, trajectory profile, and difficulty — see the Full Dataset page. The samples below showcase all three domains — airline, telecom, and retail — where the agent must authenticate the customer, drive a multi-turn conversation against a simulated backend, and stay inside a written service policy.

Overview

Property	Value
Domains	Airline, Telecom, Retail
Tasks	10 per domain, with gold evaluation criteria
Turn type	Multi-turn, user-simulator driven
Scenarios	Airline: booking, modification, cancellation, baggage, refunds · Telecom: device troubleshooting, roaming, data, billing · Retail: returns, exchanges, address/payment changes
Grading	Action- and state-level checks against the reference action sequence

Environment

Each domain ships in the tau-bench four-folder layout — intent / datapoints / evaluators / reference_payloads, joined by a shared NNNNNN id (000001–000010). The self-contained backend database the agent reads and writes is embedded in each sample’s reference_payloads/ as a full initial-state snapshot, so every trajectory replays deterministically against the same world.

tau2_bench/
├── tau2-airline/
│   ├── intent/              # 000001.json … : goal, profile, persona, motivations, constraints
│   ├── datapoints/          # SFT-ready multi-turn dialog (+reasoning, tool_calls) + tool_schema
│   ├── evaluators/          # evaluator_NNNNNN.py : action- & state-level grading script
│   └── reference_payloads/  # full backend DB snapshot (initial state) + reference diff
├── tau2-telecom/            # same four folders (plans, lines, devices, bills + simulated handset)
└── tau2-retail/             # same four folders (products, users, orders)

Airline environment — 300 flights, 500 users, 2000 reservations

User entry:

{
  "user_id": "olivia_lewis_66ef61",
  "name": { "first_name": "Olivia", "last_name": "Lewis" },
  "address": { "address1": "456 Oak Street", "address2": "Apt 12", "city": "Chicago", "state": "IL", "zip": "60601" },
  "email": "olivia.lewis@example.com",
  "dob": "1985-07-15",
  "payment_methods": {
    "certificate_7891234": { "source": "certificate", "amount": 150.0 },
    "gift_card_1234567": { "source": "gift_card", "amount": 80.0 }
  }
}

Flight entry:

{
  "flight_number": "HAT001",
  "origin": "PHL",
  "destination": "LGA",
  "dates": {
    "2024-05-16": {
      "status": "available",
      "available_seats": { "basic_economy": 16, "economy": 10, "business": 13 },
      "prices": { "basic_economy": 87, "economy": 122, "business": 471 }
    }
  }
}

Cabin classes are basic economy, economy, and business; membership levels are regular, silver, and gold. The agent must keep full price precision and obtain confirmation before any booking, modification, or cancellation.

Telecom environment — plans, lines, devices, customers, bills + a simulated handset

Plan entry (telecom_db.toml):

[[plans]]
plan_id = "P1002"
name = "Premium Plan"
data_limit_gb = 15.0
price_per_month = 65.0
data_refueling_price_per_gb = 2.0

Device entry:

[[devices]]
device_id = "D1001"
device_type = "phone"
model = "Smartphone X"
imei = "123456789012345"

Beyond the account database (plans, lines, customers, bills), telecom uniquely exposes a simulated handset the agent inspects and reconfigures: airplane mode, network mode, APN settings, Wi-Fi and Wi-Fi calling, VPN, SIM status, data saver, app permissions, and reboot. Troubleshooting an MMS failure or “No Service” means running this device through a diagnostic sequence.

Retail environment — 190 products, 490 users, 2000 orders

Product entry:

{
  "name": "Laptop",
  "product_id": "4760268021",
  "variants": {
    "2216662955": {
      "item_id": "2216662955",
      "options": { "screen size": "15-inch", "processor": "i5", "ram": "32GB", "storage": "256GB SSD", "color": "space grey" },
      "available": true,
      "price": 2520.52
    }
  }
}

Order entry:

{
  "order_id": "#W8797321",
  "user_id": "omar_johnson_2562",
  "status": "pending",
  "items": [
    { "name": "Indoor Security Camera", "item_id": "8470360507", "price": 291.31 },
    { "name": "Laptop", "item_id": "3478699712", "price": 2291.87 }
  ],
  "payment_history": [
    { "transaction_type": "payment", "amount": 2583.18, "payment_method_id": "gift_card_9532915" }
  ]
}

Orders move through pending → processed → delivered → cancelled, and the available action depends on status: pending orders can be modified or cancelled, delivered orders can be returned or exchanged.

Data

Each task is described across four parallel files that share an NNNNNN id:

Folder	File	Description
`intent/`	`000001.json`	The task: customer `goal`, `profile`, `persona`, `motivations`, and hard `constraints` (the user-simulator scenario)
`datapoints/`	`000001.json`	The SFT-ready multi-turn dialog — system policy, user-simulator turns, assistant messages with reasoning, tool calls, and tool results — plus the domain `tool_schema`
`evaluators/`	`evaluator_000001.py`	The per-task grading script — action- and state-level checks against the reference action sequence
`reference_payloads/`	`reference_payload_000001.json`	The full backend database snapshot (initial state) plus the reference diff for deterministic grading

Sample

Airline — sample tasks

Task	Scenario
Change reservation HKEG34 to the best nonstop option, then add two checked bags
Upgrade cabin from basic economy → business on an existing reservation
Shift all outbound legs to depart after 10:00 while keeping the return unchanged
Move reservation SGTEEY to the earliest LGA→PHL flight before May 20
Correct a passenger name and add checked baggage
Book the cheapest direct morning DEN→ORD one-way in economy, 1 bag, with travel insurance
Upgrade from basic economy → economy on an upcoming round trip

Telecom — sample tasks

Task	Scenario
Cannot send MMS for the past few hours — diagnose and successfully send a picture message
Phone shows “No Service” — restore connectivity
Mobile data drops or is very slow — must reach excellent speed, no Wi-Fi available
Roaming not working abroad — enable roaming and confirm data works
Refuel data after hitting the plan limit
Billing and data-usage inquiry, with line suspend/resume on request

Retail — sample tasks

Task	Scenario
Swap a basketball variant (leather → composite, same size) on a pending order, pay the difference on file
Cancel order #W6175707; if shipped, modify the suitcase to a larger soft-shell instead
Update an order’s address; on learning it’s delivered, cancel it instead
Update a pending order’s shipping address, then switch its payment method
Return a Duvet; realize the mix-up and return a Pillow from a different order instead
Change a moisturizer to a different skin type; exchange a scarf for a shorter length
Cancel an order; if delivered, return the bamboo cutting board for a refund

These scenarios deliberately include fallback intents — the customer’s first request is often out of policy (e.g. swapping to a completely different product, cancelling a shipped order), and the agent must recognize this and pivot to the allowed alternative.

Full retail dialog (29 messages, 8 tool calls) — address update + product lookups + policy-correct transfer

Each assistant message includes reasoning (chain-of-thought) showing the agent’s step-by-step decisions. Reasoning is shown in italics below.User:

Hi there! I need to update my shipping address in my account. I’d like to change it to 123 Oak Street, Apartment 5B, Dallas, Texas 75201. Can you help me with that?

Assistant:Per policy, I must authenticate the user before changing anything on their account — locate their user id via email, or name + zip.

I’d be happy to help update your default shipping address. But first I need to verify your identity. Could you provide the email address on your account?

User:

Sure! My email address is amara.diallo5790@example.com.

Assistant — authenticate:

find_user_id_by_email({"email": "amara.diallo5790@example.com"})

"amara_diallo_2400"

Great, you’re verified. Just to confirm, I’ll set your default shipping address to 123 Oak Street, Apt 5B, Dallas, TX 75201 — shall I go ahead?

User:

Yes, that’s correct! Please go ahead and update it. Also, I had a cancelled order #W0877487 for a Bird Feeder — I’m curious what the price difference would be for a different variant.

Assistant — update address, then look up the order and product (the user confirmed the write):

modify_user_address({"user_id": "amara_diallo_2400", "address1": "123 Oak Street", "address2": "Apartment 5B", "city": "Dallas", "state": "TX", "country": "USA", "zip": "75201"})
get_order_details({"order_id": "#W0877487"})
get_product_details({"product_id": "8222567674"})
calculate({"expression": "15.56 - 21.32"})

Done — your default address is now 123 Oak Street, Apt 5B, Dallas, TX 75201. On the Bird Feeder, the variant you mentioned is $5.76 cheaper than the one on the cancelled order.

User:

Could you tell me about Tennis Rackets too? And could you place a new order for the Bird Feeder variant and a Tennis Racket?

Assistant — provide product info:

list_all_product_types({})
get_product_details({"product_id": "2751056608"})

Here are the available Tennis Racket variants and prices… (lists variants)

User:

That’s disappointing you can’t place the order. Is there someone else who could help me place it?

Assistant:Policy is explicit: I can cancel/modify pending orders, return/exchange delivered orders, and modify the default address — but I cannot place a brand-new order. The request is outside my scope, so I transfer to a human.

transfer_to_human_agents({"summary": "Customer Amara Diallo wants to place a new order for a Bird Feeder (platform/metal/6lb) and a Tennis Racket."})

YOU ARE BEING TRANSFERRED TO A HUMAN AGENT. PLEASE HOLD ON.

This trajectory shows the full pattern: authenticate → confirm before writing → answer read-only questions → refuse-and-transfer when a request (placing a new order) falls outside the agent’s tools.

RL evaluation criteria (airline cabin upgrade)

RL tasks ship a user-simulator scenario and the gold action sequence a passing trajectory must reproduce.User-simulator scenario:

{
  "task_instructions": "You are Noah Khan (user_id: noah_khan_8166) and you want to upgrade your upcoming economy flight (SFO → MCO on 2024-05-26) to business class. Provide reservation_id 'JU4FIC' and payment 'credit_card_5669132' when asked, and confirm 'yes' before the change is made.",
  "reason_for_call": "Upgrade cabin class from economy to business for existing reservation"
}

Evaluation criteria (gold actions):

{
  "actions": [
    { "name": "get_user_details", "arguments": { "user_id": "noah_khan_8166" } },
    { "name": "get_reservation_details", "arguments": { "reservation_id": "JU4FIC" } },
    { "name": "search_onestop_flight", "arguments": { "date": "2024-05-26", "origin": "SFO", "destination": "MCO" } },
    { "name": "update_reservation_flights", "arguments": {
        "reservation_id": "JU4FIC", "cabin": "business", "payment_id": "credit_card_5669132",
        "flights": [ { "flight_number": "HAT144", "price": 419 }, { "flight_number": "HAT181", "price": 216 } ] } }
  ],
  "communicate_info": []
}

A trajectory passes when it executes the gold state-changing actions (with matching arguments) and communicates any required facts — graded against this reference rather than a free-text rubric.

Download

# Download all Tau2-Bench demo data
hf download jindidi/eigendata-demo-data --repo-type dataset --include "tau2_bench/*"

# Download a specific domain
hf download jindidi/eigendata-demo-data --repo-type dataset --include "tau2_bench/tau2-airline/*"

Browse on Hugging Face

View Tau2-Bench files

For the complete Tau2-Bench corpus — all environments, SFT dialogs, RL tasks, and commercial licensing — see the Full Dataset page. For the harder single-domain banking slice, see Tau3-Bench.

Eigen AI

API Reference

Platform

Products

Overview

Environment

Data

Sample

Download

Browse on Hugging Face

​Overview

​Environment

​Data

​Sample

​Download

Browse on Hugging Face

Overview

Environment

Data

Sample

Download