Tau2-Bench

This demo contains multi-turn function-calling dialogs generated for the Tau2-Bench benchmark. Each sub-domain simulates a realistic customer service environment with tool calls to backend APIs.

Overview

Property	Airline	Retail	Banking
Samples	10	10	10
Turn type	Multi-turn	Multi-turn	Multi-turn
Scenarios	Booking, cancellation, baggage, refunds, flight changes	Order tracking, returns, exchanges, address changes, product upgrades	Account management, transfers, disputes, credit cards, loan inquiries

Environment

Each sub-domain includes a reference_payloads/ directory containing MCP server state snapshots — the full database state used during generation. These represent the simulated backend that the agent interacts with via tool calls.

tau2-bench/
├── tau2-airline/
│   └── reference_payloads/    # MCP server state (flights, reservations, users)
├── tau2-retail/
│   └── reference_payloads/    # MCP server state (orders, products, customers)
└── tau3-banking/
    └── reference_payloads/    # MCP server state (accounts, transactions, credit cards)

Airline environment — 300 flights, 500 users, 2000 reservations

User entry:

{
  "user_id": "emma_kim_9957",
  "name": { "first_name": "Emma", "last_name": "Kim" },
  "email": "emma.kim3947@example.com",
  "dob": "1977-09-23",
  "membership": "gold",
  "payment_methods": {
    "credit_card_5832574": { "source": "credit_card", "last_four": "5241", "brand": "visa" },
    "gift_card_9562694": { "source": "gift_card", "amount": 1114 }
  },
  "saved_passengers": [
    { "first_name": "Mason", "last_name": "Gonzalez", "dob": "1952-04-21" }
  ]
}

Flight entry:

{
  "origin": "PHL",
  "destination": "LGA",
  "flight_number": "HAT001",
  "scheduled_departure_time_est": "06:00:00",
  "scheduled_arrival_time_est": "07:00:00",
  "dates": {
    "2024-05-16": {
      "status": "available",
      "available_seats": { "basic_economy": 16, "economy": 10, "business": 13 },
      "prices": { "basic_economy": 87, "economy": 122, "business": 471 }
    }
  }
}

Reservation entry:

{
  "reservation_id": "4WQ150",
  "user_id": "chen_jackson_3290",
  "origin": "DFW",
  "destination": "LAX",
  "flight_type": "round_trip",
  "cabin": "business",
  "passengers": [ /* passenger list */ ],
  "payment_history": [
    { "payment_id": "gift_card_3576581", "amount": 4986 }
  ]
}

Retail environment — 50 products, 500 users, 1000 orders

Product entry:

{
  "name": "Laptop",
  "product_id": "4760268021",
  "variants": {
    "2216662955": {
      "item_id": "2216662955",
      "options": { "screen size": "15-inch", "processor": "i5", "ram": "32GB", "storage": "256GB SSD", "color": "space grey" },
      "available": true,
      "price": 2520.52
    }
  }
}

User entry:

{
  "user_id": "omar_johnson_2562",
  "name": { "first_name": "Omar", "last_name": "Johnson" },
  "address": { "address1": "349 Cedar Street", "address2": "Suite 322", "city": "Denver", "state": "CO", "zip": "80266" },
  "email": "omar.johnson6791@example.com",
  "payment_methods": {
    "gift_card_9532915": { "source": "gift_card", "balance": 61.0 },
    "paypal_6053880": { "source": "paypal" }
  },
  "orders": ["#W2809253", "#W8516166", "#W8797321"]
}

Order entry:

{
  "order_id": "#W8797321",
  "user_id": "omar_johnson_2562",
  "address": { "address1": "349 Cedar Street", "address2": "Suite 322", "city": "Denver", "state": "CO", "zip": "80266" },
  "status": "pending",
  "items": [
    { "name": "Indoor Security Camera", "item_id": "8470360507", "price": 291.31, "options": { "resolution": "2K", "field of view": "130 degrees", "connectivity": "Ethernet" } },
    { "name": "Laptop", "item_id": "3478699712", "price": 2291.87, "options": { "screen size": "15-inch", "processor": "i5", "ram": "16GB", "storage": "512GB SSD", "color": "space grey" } }
  ],
  "payment_history": [
    { "transaction_type": "payment", "amount": 2583.18, "payment_method_id": "gift_card_9532915" }
  ]
}

Banking environment — 24 tables including users, accounts, credit cards, transactions

User entry:

{
  "name": "Mia Smith",
  "user_id": "123",
  "address": "742 Maple Street, Austin, TX 78701",
  "email": "mia.smith8574@gmail.com",
  "phone_number": "512-555-0147",
  "date_of_birth": "03/15/1992"
}

Account entry:

{
  "account_id": "01",
  "user_id": "123",
  "class": "checking",
  "level": "Blue Account",
  "date_opened": "11/08/2025",
  "status": "OPEN",
  "current_holdings": "0"
}

Credit card account entry:

{
  "account_id": "cc_76ad9cc60e_gold",
  "user_id": "76ad9cc60e",
  "card_type": "Gold Rewards Card",
  "date_of_account_open": "09/15/2024",
  "current_balance": "$127.43",
  "reward_points": "8421 points"
}

Database also includes: debit_cards, referrals, credit_card_applications, verification_history, credit_card_transaction_history, cash_back_disputes, bank_account_transaction_history, transaction_disputes, payment_history, credit_card_orders, debit_card_orders, credit_card_closure_reasons, credit_card_account_flags, credit_limit_increase_requests, debit_card_disputes, interest_discrepancy_reports, and more.

Data

Each sub-domain contains:

Directory	Description
`datapoints/`	Generated multi-turn dialogs with tool calls
`intent/`	User intent definitions — goals, profiles, personas, and motivations
`evaluators/`	Python grading scripts that verify agent responses
`reference_payloads/`	Full environment snapshots for deterministic replay

Sample

Full retail dialog (24 messages, 7 tool calls)

User:

Hi — can you help me update the shipping address on my pending order W8797321? My account email is omar.johnson6791@example.com.

Assistant:Step 1 — Look up user and order in parallel:User wants to update the shipping address on pending order W8797321 and provided their email. I should look up their user ID and fetch the order details to confirm status and ownership before proceeding.

find_user_id_by_email({"email": "omar.johnson6791@example.com"})
get_order_details({"order_id": "#W8797321"})

Tool results:

{"status": "success", "result": {"user_id": "omar_johnson_2562"}}

{"order_id": "#W8797321", "user_id": "omar_johnson_2562", "status": "pending", "address": {"address1": "349 Cedar Street", "address2": "Suite 322", "city": "Denver", ...}, "items": [{"name": "Indoor Security Camera", ...}, {"name": "Laptop", "item_id": "3478699712", "price": 2291.87, ...}], ...}

Step 2 — Final answer:I have confirmed the order is pending and associated with the user. Next I need the new shipping address details to proceed with the update.

Got it! Order #W8797321 is pending, so we can definitely update the address. What’s the new shipping address you’d like to use?

User:

Yep — please change it to 912 Elm Street, Suite 173, Jacksonville, FL 32228.

Assistant:I have the full new address from the user, but I should confirm it before making a change to a pending order.

Just to confirm, I’ll update the shipping address on order #W8797321 to: 912 Elm Street, Suite 173, Jacksonville, FL 32228, USA Does that look right? Just say yes and I’ll make the change!

User:

Yes, that’s correct — please go ahead and update it to 912 Elm Street, Suite 173, Jacksonville, FL 32228.

Assistant:Step 1 — Update shipping address:User confirmed the address update, so I will submit the address change for the pending order with the confirmed address fields.

modify_pending_order_address({"order_id": "#W8797321", "address1": "912 Elm Street", "address2": "Suite 173", "city": "Jacksonville", "state": "FL", "country": "USA", "zip": "32228"})

Tool result:

{"order_id": "#W8797321", "address": {"address1": "912 Elm Street", "address2": "Suite 173", "city": "Jacksonville", "state": "FL", ...}, "status": "pending", ...}

Step 2 — Final answer:The tool confirms the address update succeeded, so I should acknowledge completion and offer further help.

Done! The shipping address on order #W8797321 has been updated to 912 Elm Street, Suite 173, Jacksonville, FL 32228. Is there anything else I can help you with?

User:

Thanks. Next question — on the laptop in that order, what upgrade options are available for that model, and what would the price difference be if I bumped up the processor or storage?

Assistant:Step 1 — Fetch product variants:To answer upgrade options and price differences for the laptop in the order, I need the product’s variant list and pricing, so I will fetch product details for product_id 4760268021.

get_product_details({"product_id": "4760268021"})

Tool result:

{"name": "Laptop", "product_id": "4760268021", "variants": {"6017636844": {"item_id": "6017636844", "options": {"screen size": "15-inch", "processor": "i7", "ram": "32GB", "storage": "1TB SSD", "color": "space grey"}, "available": true, "price": 2292.37}, ...}}

Step 2 — Final answer:Using the product variants, I identified the closest in-stock upgrades relative to the customer’s current 15-inch space grey configuration and computed the price deltas from $2,291.87 to each available variant, highlighting the best match that keeps size/color.

Here’s what I found for available upgrades on your laptop (currently 15-inch, i5, 16GB RAM, 512GB SSD, space grey at $2,291.87): Keeping the same 15-inch space grey look, there’s one great available option — i7 processor, 32GB RAM, 1TB SSD at $2,292.37 — that’s only $0.50 more than your current config!

User:

Whoa, how is the i7/32GB/1TB space grey basically the same price? If that’s real, yes I want to swap to that upgrade. Also, I want to return the Vacuum Cleaner from my other order W2809253 — can you do that too?

Assistant:Step 1 — Fetch second order for return:The user wants to proceed with a return from delivered order W2809253, so I need that order’s details (item IDs and payment method) to start the return.

get_order_details({"order_id": "#W2809253"})

Tool result:

{"order_id": "#W2809253", "status": "delivered", "items": [{"name": "Vacuum Cleaner", "item_id": "7958300294", ...}], ...}

Step 2 — Final answer:I should reassure the user about the upgrade being in stock and clarify the exact variant and price difference. For the return, I now have the vacuum cleaner item_id and the PayPal payment method from the delivered order, but I need explicit confirmation to proceed with both actions.

Yes, it’s real! The 15-inch i7/32GB RAM/1TB SSD in space grey (item 6017636844) is priced at $2,292.37 — just $0.50 more than your current laptop. And it is in stock. For the return of the Vacuum Cleaner from order #W2809253 (delivered), I can process that too. Which payment method should receive the refund — your gift card or PayPal?

User:

Yes, please go ahead with both. Swap the laptop on order W8797321 to the 15-inch space grey i7/32GB/1TB option, and then start the return for the Vacuum Cleaner from order W2809253 with the refund going to PayPal.

Assistant:Step 1 — Swap laptop item and initiate return in parallel:User confirmed both actions. I will (1) modify the pending order to swap the laptop variant and charge the small price difference to the gift card on file, and (2) initiate a return for the vacuum cleaner from the delivered order with refund to PayPal.

modify_pending_order_items({"order_id": "#W8797321", "item_ids": ["3478699712"], "new_item_ids": ["6017636844"], "payment_method_id": "gift_card_9532915"})
return_delivered_order_items({"order_id": "#W2809253", "item_ids": ["7958300294"], "payment_method_id": "paypal_6053880"})

Tool results:

{"order_id": "#W8797321", "status": "pending (items modified)", "items": [{"name": "Indoor Security Camera", ...}, {"name": "Laptop", "item_id": "6017636844", "price": 2292.37, "options": {"processor": "i7", "ram": "32GB", "storage": "1TB SSD", ...}}], ...}

{"order_id": "#W2809253", "status": "return requested", "return_items": ["7958300294"], "return_payment_method_id": "paypal_6053880", ...}

Step 2 — Final answer:Both tool calls succeeded, so I should confirm the laptop swap (including the $0.50 charge) and confirm the return request and refund destination/amount, then offer further assistance.

Both done! Here’s a quick summary: Your laptop on order #W8797321 has been swapped to the 15-inch i7/32GB/1TB SSD space grey — $0.50 was charged to your gift card for the difference. The return for the Vacuum Cleaner on order #W2809253 has been submitted — you’ll receive an email with return instructions, and the refund will go to your PayPal account.

User scenario for the above dialog

{
  "goal": "Identify Omar Johnson's account by email, change the shipping address on pending order #W8797321 from Denver to Jacksonville, upgrade the laptop to the best available 15-inch space grey variant with the highest RAM and largest storage, paying any price difference with the gift card, and initiate a return for the Vacuum Cleaner from delivered order #W2809253 with the refund going to PayPal.",
  "profile": "Omar Johnson (user_id: omar_johnson_2562) can be found via omar.johnson6791@example.com and has payment methods gift_card_9532915 (balance 61.0) and paypal_6053880. He has three orders: #W8797321 (pending, includes Laptop item 3478699712 and Indoor Security Camera item 8470360507), #W2809253 (delivered, includes Vacuum Cleaner item 7958300294), and #W8516166 (cancelled/refunded).",
  "persona": "Busy, practical online shopper; comfortable using chat support and expects quick, concrete answers about order changes and costs.",
  "motivations": [
    "Prevent the pending order from shipping to the wrong address.",
    "Upgrade the laptop to the best available 15-inch space grey option (maximizing RAM and storage) without changing screen size or color, and return an unwanted item from a delivered order."
  ],
  "constraints": [
    "Only update the shipping address for order #W8797321; do not modify or cancel the order unless explicitly confirmed.",
    "For the laptop upgrade, prefer the available 15-inch space grey variant with the most RAM and most storage; pay any price difference using the gift card.",
    "For the return, only return the Vacuum Cleaner from order #W2809253, not any other items; refund must go to paypal_6053880.",
    "Use valid payment methods on file: gift_card_9532915 or paypal_6053880 as appropriate.",
    "No relative-time language should be required to complete the task; proceed based on order statuses (pending vs delivered)."
  ]
}

Environment snapshot (MCP server state)

Each reference payload contains the full MCP server database state — products, users, and orders — used during the conversation. Here is a condensed view:

{
  "environment_snapshots": {
    "mcp_8002": {
      "initial_state": {
        "database": {
          "products": { /* 50 product entries */ },
          "users": { /* 500 user entries */ },
          "orders": { /* 1000 order entries */ }
        }
      }
    }
  }
}

Example user entry:

{
  "user_id": "omar_johnson_2562",
  "name": { "first_name": "Omar", "last_name": "Johnson" },
  "email": "omar.johnson6791@example.com",
  "address": { "address1": "349 Cedar Street", "address2": "Suite 322", "city": "Denver", "state": "CO", "zip": "80266" },
  "payment_methods": {
    "gift_card_9532915": { "source": "gift_card", "balance": 61.0 },
    "paypal_6053880": { "source": "paypal" }
  },
  "orders": ["#W2809253", "#W8516166", "#W8797321"]
}

Example order entry:

{
  "order_id": "#W8797321",
  "user_id": "omar_johnson_2562",
  "address": { "address1": "349 Cedar Street", "address2": "Suite 322", "city": "Denver", "state": "CO", "zip": "80266" },
  "status": "pending",
  "items": [
    { "name": "Indoor Security Camera", "item_id": "8470360507", "price": 291.31 },
    { "name": "Laptop", "item_id": "3478699712", "price": 2291.87, "options": { "screen size": "15-inch", "processor": "i5", "ram": "16GB", "storage": "512GB SSD", "color": "space grey" } }
  ],
  "payment_history": [
    { "transaction_type": "payment", "amount": 2583.18, "payment_method_id": "gift_card_9532915" }
  ]
}

Evaluator excerpt (~960 lines Python, three-part evaluation)

Each sample ships with an auto-generated evaluator that scores submissions on three axes:Part 1 — Database state comparison (db_check): deep-compares the final MCP server state against the reference payload using a structured diff. Fields are classified by type — exact match for IDs, amounts, and statuses; semantic match for free-text; ignore for timestamps and tokens.

REFERENCE_DIFF = {
  "mcp_8002": {
    "changed": {
      "orders.#W8797321.address.address1": { "before": "349 Cedar Street", "after": "912 Elm Street" },
      "orders.#W8797321.address.city": { "before": "Denver", "after": "Jacksonville" },
      "orders.#W8797321.address.state": { "before": "CO", "after": "FL" },
      "orders.#W8797321.items": { /* laptop swapped from i5/16GB/512GB to i7/32GB/1TB */ },
      "orders.#W2809253.status": { "before": "delivered", "after": "return requested" },
      "orders.#W2809253.return_items": { "before": null, "after": ["7958300294"] },
      "orders.#W2809253.return_payment_method_id": { "before": null, "after": "paypal_6053880" }
    }
  }
}

Part 2 — Rubrics evaluation (rubrics_check): evaluates the trajectory against task-specific rubrics covering goal achievement, required process steps, and step ordering.Part 3 — Domain policy compliance (policy_check): verifies the agent followed domain rules — e.g., authenticating the user before taking actions, confirming changes before executing, only modifying orders with the correct status.The combined verdict:

def evaluate_submission(reference_payload, eval_payload, llm_client) -> dict:
    """Returns db_check, rubrics_check, policy_check, overall_pass."""
    result["db_check"] = db_check(reference_payload, eval_payload, llm_client)
    result["rubrics_check"] = rubrics_check(eval_payload, llm_client)
    result["policy_check"] = policy_check(eval_payload, llm_client)
    # Overall: all three must pass
    result["overall_pass"] = all(
        result[k].get("passed", False) for k in ("db_check", "rubrics_check", "policy_check")
    )

Evaluation Results

We evaluated the generated data by running frontier models as agents against the synthesized environments and grading with the auto-generated evaluators. Each evaluator scores on three axes: DB Check (correct database mutations), Rubrics (goal and process compliance), and Policy (domain policy adherence). All Three requires passing all three checks. Airline

Model	DB Check	Rubrics	Policy	DB + Rubrics	All Three
Gemini-3.1-Pro-Preview	90.0	60.0	70.0	60.0	30.0
Claude-Opus-4-6	80.0	60.0	70.0	60.0	40.0
Grok-4.20	80.0	30.0	20.0	30.0	10.0
GPT-5.3-Codex	60.0	50.0	50.0	50.0	40.0

Banking

Model	DB Check	Rubrics	Policy	DB + Rubrics	All Three
Claude-Opus-4-6	50.0	40.0	80.0	20.0	20.0
GPT-5.3-Codex	10.0	10.0	30.0	0.0	0.0

Retail

Model	DB Check	Rubrics	Policy	DB + Rubrics	All Three
Grok-4.20	80.0	80.0	60.0	60.0	30.0
Claude-Opus-4-6	70.0	50.0	90.0	40.0	30.0
Gemini-3.1-Pro-Preview	70.0	30.0	60.0	30.0	10.0
GPT-5.3-Codex	60.0	60.0	50.0	40.0	30.0

Download

# Download all Tau2-Bench data
hf download jindidi/eigendata-demo-data --repo-type dataset --include "tau2_bench/*"

# Download a specific sub-domain
hf download jindidi/eigendata-demo-data --repo-type dataset --include "tau2_bench/tau2-airline/*"

Browse on Hugging Face

View Tau2-Bench files

Eigen AI

API Reference

Platform

Products

Overview

Environment

Data

Sample

Evaluation Results

Download

Browse on Hugging Face

Eigen AI

API Reference

Platform

Products

​Overview

​Environment

​Data

​Sample

​Evaluation Results

​Download

Browse on Hugging Face

Overview

Environment

Data

Sample

Evaluation Results

Download