Demo Samples

These demo samples are a free 10-task slice of the Tau3-Bench dataset. For the complete dataset — its scale, the discoverable-tool mechanic, and benchmark difficulty — see the Full Dataset page. The samples below showcase the banking domain (“Rho-Bank”), where the agent must verify the customer’s identity, search a knowledge base and unlock hidden tools at runtime, apply KB-grounded policy, and confirm before any high-impact write — then be graded by an executable per-task evaluator.

Overview

Property	Value
Domain	Banking (Rho-Bank)
Tasks	10 banking conversations with executable evaluators
Turn type	Multi-turn, with discoverable (dynamic) tools
Scenarios	Card disputes, debit-card activation, PIN reset, interest/APY inquiries, reward-point checks
Grading	Per-task Python evaluator — DB-diff + rubric + policy, all LLM-judged

Environment

The banking environment is an MCP-backed database with ~24 tables — users, bank accounts, credit-card accounts, debit cards, transactions, disputes, verification history, and more. Each task ships a reference payload containing the full initial state plus the reference diff a correct trajectory should produce, so grading is deterministic.

tau3_bench/tau3-banking/
├── intent/                # task spec: goal, profile, persona, motivations, constraints
├── datapoints/            # SFT-ready multi-turn dialogs (with discover/unlock/call tool calls)
├── evaluators/            # per-task Python grading scripts (DB + rubrics + policy)
└── reference_payloads/    # full environment snapshots (initial state + reference diff)

Banking environment — ~24 tables, discoverable-tool pool

User entry:

{
  "name": "Alex Riviera",
  "user_id": "126",
  "address": "24 Beacon Street, Boston, MA, 02114",
  "email": "alex_alex@gmail.com",
  "phone_number": "930-102-1332",
  "date_of_birth": "09/21/1996"
}

Credit-card account entry:

{
  "account_id": "cc_126_silver",
  "user_id": "126",
  "card_type": "Silver Rewards Card",
  "current_balance": "$1,247.83"
}

Debit card entry (pending activation):

{
  "card_id": "dbc_ar126_blue",
  "account_id": "chk_ar126",
  "status": "PENDING",
  "issue_reason": "new_account"
}

The database also includes debit_cards, transaction_disputes, verification_history, credit_card_transaction_history, bank_account_transaction_history, referrals, credit_card_applications, interest_discrepancy_reports, and more. Sensitive operations (activating a card, filing a dispute, looking up card digits) are not in the base tool set — they live in a discoverable-tool pool the agent unlocks at runtime.

Data

Each task is described across four parallel files (sharing an index):

Directory	File	Description
`intent/`	`000001.json`	Goal, customer profile, persona, motivations, and hard constraints
`datapoints/`	`000001.json`	The full multi-turn dialog with reasoning and tool calls
`evaluators/`	`evaluator_000001.py`	The Python grading script (DB-diff + rubrics + policy)
`reference_payloads/`	`reference_payload_000001.json`	Initial DB state + reference diff for deterministic grading

Sample

Sample tasks (goals)

Task	Goal
File a dispute for a defective Best Buy charge (correct category, not fraud), and activate a pending new debit card with a chosen PIN
Understand how Silver Account savings interest works and confirm the applicable APY tier
Ask whether Rho-Bank offers business checking and whether it must be a new account
Activate a replacement debit card (issued after the previous was stolen) and verify identity first
Activate a new debit card ending in 2847 by providing card details and a PIN
Reset a forgotten PIN on an active card (forgotten-PIN flow) to 8527
Activate a reissued debit card using the printed expiration date and CVV
Investigate a missing goodwill credit tied to a savings-interest issue
Activate a debit card mailed to replace an expired one, with a chosen PIN
Inquire about a savings-interest correction and review reward points on two cards

Note how several tasks hinge on the same surface operation (debit-card activation) but a different flow — new account vs. replacement vs. reissue vs. expired — so the agent must read issue_reason and discover the matching activation tool.

Full task intent (task 000001)

{
  "goal": "Alex Riviera (user_id 126) wants to file a dispute for the Best Buy credit card transaction (txn_b8c9d0e1f2a3) on the Silver Rewards Card for merchandise that arrived defective, using the dispute category that corresponds to defective merchandise rather than fraud or a billing-amount issue, and also wants to activate the pending new debit card ending in 7293 by setting the PIN to 4821.",
  "persona": "A young professional in their late 20s, comfortable with online banking but tends to over-explain when something goes wrong.",
  "constraints": [
    "Identity verification must be completed before any account-specific operations",
    "The dispute is for defective merchandise and must be categorized as 'item received but unusable', not fraud or incorrect amount",
    "The assistant must retrieve the Silver Rewards Card last 4 digits from system tools rather than asking Alex to guess them",
    "Filing the dispute requires explicit user confirmation as it is a high-impact action",
    "The debit card has issue_reason 'new_account' so the correct new-card activation tool must be used",
    "The debit card must be activated with a 4-digit PIN of 4821"
  ]
}

Full agent trajectory (task 000001 — 47 messages, 19 tool calls)

A compact view of the dispute-and-activation trajectory. Note the recurring discover → unlock → call pattern for every sensitive operation.User:

I think I need to file a dispute? I bought something from Best Buy on my Silver Rewards Card and it arrived defective. I also got a new debit card I need to activate.

Assistant — verify identity first (KB-grounded policy):

get_user_information_by_name({"customer_name": "Alex Riviera"})   // → user_id 126
get_current_time({})                                             // → 2025-11-14 03:40 EST
log_verification({"name": "Alex Riviera", "user_id": "126", ...}) // 2+ identity fields matched

Assistant — ground both operations in the knowledge base:

KB_search({"query": "credit card transaction dispute defective merchandise provisional credit"})
KB_search({"query": "debit card activation issue_reason new card activate_debit_card"})

Assistant — discover and call hidden tools to gather account context:

unlock_discoverable_agent_tool({"agent_tool_name": "get_all_user_accounts_by_user_id_3847"})
call_discoverable_agent_tool({"agent_tool_name": "get_all_user_accounts_by_user_id_3847", "arguments": "{\"user_id\":\"126\"}"})
unlock_discoverable_agent_tool({"agent_tool_name": "get_debit_cards_by_account_id_7823"})
call_discoverable_agent_tool({"agent_tool_name": "get_debit_cards_by_account_id_7823", "arguments": "{\"account_id\":\"chk_ar126\"}"})

Assistant — activate the new-account debit card (matching issue_reason), with PIN 4821:

unlock_discoverable_agent_tool({"agent_tool_name": "activate_debit_card_8291"})
call_discoverable_agent_tool({"agent_tool_name": "activate_debit_card_8291",
  "arguments": "{\"card_id\":\"dbc_ar126_blue\",\"last_4_digits\":\"7293\",\"pin\":\"4821\", ...}"})
// → Status: ACTIVE

User pushes back on the dispute category (“it’s defective, not just ‘not as described’”). The agent re-checks the KB, explains the available categories honestly, and refuses to invent one.Assistant — retrieve the card’s last 4 digits from a tool (not from the user), per constraint:

unlock_discoverable_agent_tool({"agent_tool_name": "get_card_last_4_digits"})
call_discoverable_agent_tool({"agent_tool_name": "get_card_last_4_digits", "arguments": "{\"credit_card_account_id\":\"cc_126_silver\"}"})
// → 1961

Assistant — present the full dispute for confirmation:

Here’s what I’ll file: Best Buy charge txn_b8c9d0e1f2a3, category goods/services not as described, full refund requested, card ending 1961. Shall I file it?

User:

Yes, file the dispute.

Assistant — file only after explicit confirmation:

unlock_discoverable_agent_tool({"agent_tool_name": "file_credit_card_transaction_dispute_4829"})
call_discoverable_agent_tool({"agent_tool_name": "file_credit_card_transaction_dispute_4829",
  "arguments": "{\"transaction_id\":\"txn_b8c9d0e1f2a3\",\"dispute_reason\":\"goods_services_not_as_described\", ...}"})
// → Dispute filed (status SUBMITTED)

Both done — your debit card is activated, and the Best Buy dispute has been filed.

Evaluator rubrics & grading (task 000001)

The per-task Python evaluator carries the reference diff and a list of typed rubrics, then runs three LLM-judged checks.Sample rubrics:

[
  { "id": "R1", "type": "goal",
    "description": "The Best Buy dispute was filed with correct parameters",
    "criteria": "call file_credit_card_transaction_dispute_4829 with transaction_id, dispute_reason='goods_services_not_as_described', resolution_requested='full_refund', card_last_4_digits='1961', eligible_for_provisional_credit=false; dispute record appears in DB diff" },
  { "id": "R2", "type": "goal",
    "description": "The pending debit card was activated with the new-account tool and PIN 4821",
    "criteria": "call activate_debit_card_8291 (NEW-card flow, matching issue_reason 'new_account'); DB diff shows status PENDING → ACTIVE" },
  { "id": "R3", "type": "process",
    "description": "Identity verified before any account-specific operation" },
  { "id": "R5", "type": "process_order",
    "description": "Explicit confirmation obtained before filing the dispute" },
  { "id": "R6", "type": "process_order",
    "description": "get_current_time returned before log_verification used its value" }
]

Three-axis grading:

def evaluate_submission(reference_payload, eval_payload, llm_client):
    result = {
        "db_check":      db_check(reference_payload, eval_payload, llm_client),   # DB diff matches reference
        "rubrics_check": rubrics_check(eval_payload, llm_client),                # typed rubrics pass
        "policy_check":  policy_check(eval_payload, llm_client),                 # policy adhered to
    }
    result["overall_pass"] = all(result[k]["passed"] for k in
        ("db_check", "rubrics_check", "policy_check"))
    return result

overall_pass requires all three checks. Because grading runs against the reference snapshot, the score is deterministic and usable as an RL reward.

Download

Two variants are available — the full demo (SFT trajectories and RL environments) and an RL-only variant (the same tasks with environments + per-task evaluators, but no SFT trajectories):

# Full demo — SFT trajectories + RL environments (intent / datapoints / evaluators / reference_payloads)
hf download jindidi/eigendata-demo-data --repo-type dataset --include "tau3_bench/tau3-banking/*"

# RL-only variant — environments + evaluators, no SFT trajectories (intent / evaluators / reference_payloads)
hf download jindidi/eigendata-demo-data --repo-type dataset --include "tau3_bench/tau3-banking-rl-only/*"

Browse — full demo (SFT + RL)

intent · datapoints · evaluators · reference_payloads

Browse — RL-only variant

intent · evaluators · reference_payloads

For the complete Tau3-Bench corpus — the discoverable-tool mechanic, benchmark difficulty, and commercial licensing — see the Full Dataset page.

Eigen AI

API Reference

Platform

Products

Overview

Environment

Data

Sample

Download

Browse — full demo (SFT + RL)

Browse — RL-only variant

​Overview

​Environment

​Data

​Sample

​Download

Browse — full demo (SFT + RL)

Browse — RL-only variant

Overview

Environment

Data

Sample

Download