Overview
| Property | Single-Turn | Multi-Turn |
|---|---|---|
| Samples | 50 | 50 |
| Task type | One-shot tool calls | Conversational tool-use sequences |
| Scenarios | Check calendar events, send emails, look up contacts, update sheets | Multi-step workflows combining email, calendar, sheets, slides, and contacts |
Environment
The environment is a JSON database (db.json, ~2.4 MB) representing the full Google Workspace state — 24 tables covering 100 user accounts with their emails, calendar events, drive files, contacts, and more. This database serves as the backend that the agent queries and modifies via tool calls.
Database structure — 24 tables, 100 accounts, 5,000+ entities
Database structure — 24 tables, 100 accounts, 5,000+ entities
| Table | Entries | Table | Entries | |
|---|---|---|---|---|
google_account | 100 | drive_file | 901 | |
calendar | 200 | drive_permission | 367 | |
calendar_event | 276 | gmail_message | 279 | |
calendar_event_attendee | 374 | gmail_thread | 202 | |
contact | 203 | gmail_draft | 242 | |
contact_email | 281 | gmail_label | 229 | |
contact_phone | 285 | sheets_spreadsheet | 243 | |
contact_address | 284 | sheets_tab | 342 | |
contact_organization | 282 | slides_presentation | 213 | |
contact_url | 275 | slides_slide | 304 | |
docs_document | 284 | slides_object | 373 | |
drive_file_parent | 200 | gmail_message_label | 230 |
Data
Each format (single-turn and multi-turn) contains:| Directory | Description |
|---|---|
datapoints/ | Generated tasks/dialogs with tool calls |
intent/ | Task intent definitions — goals, user profiles, and expected actions |
evaluators/ | Python grading scripts for automated evaluation |
reference_payloads/ | Per-task environment state snapshots (initial + final) |
Single-Turn Sample
User scenario
User scenario
Full dialog (13 messages, 7 tool calls)
Full dialog (13 messages, 7 tool calls)
User:Tool results:Step 2 — Fetch spreadsheet and contact details in parallel:I have the event details: Interview prep is on 2025-05-15 from 20:15 to 22:15 (Europe/London), location is ‘Meeting room’. I need to get the spreadsheet details to know which sheet to append to, and I also need Devon Price’s contact details (email) to send them an email.Tool results:Step 3 — Append note to spreadsheet:I have all the info I need. Now I need to: 1) Append a note to the Pet Care Tracker spreadsheet, 2) Email Devon Price at devon.price@example.com with a summary. I’ll do both — append to the spreadsheet first.Tool result:Step 4 — Send email to Devon Price:The note has been appended to the spreadsheet. Now I need to email Devon Price at devon.price@example.com with a short heads-up summarizing what was added to the sheet.Tool result:Step 5 — Final answer:All tasks have been completed successfully. I should now summarize everything for the user in a friendly, natural response.
Hi — I’m logged in as Family Hub (acct_00000091), family.hub@example.com. Can you first double-check the details for my calendar event called “Interview prep” (just confirm the time and the location for me), and then add a quick note about it into my spreadsheet “Pet Care Tracker” (something like: confirmed Interview prep details and that I’ll be out during that window). After you log it, can you look up Devon Price in my contacts and email them a short heads-up summarizing what you added to the sheet?Assistant:Step 1 — Search calendar, spreadsheet, and contacts in parallel:The user wants me to: 1) Get details for ‘Interview prep’ calendar event, 2) Find the ‘Pet Care Tracker’ spreadsheet, 3) Look up Devon Price in contacts. I can do all three in parallel since they’re independent.
All done! The Interview prep event is on May 15, 2025, from 8:15 PM to 10:15 PM (Europe/London), and it’s located in the Meeting room. I also added a note to your “Pet Care Tracker” spreadsheet (Care - Summary tab) confirming those details and that you’ll be out during that window, and I emailed Devon Price (devon.price@example.com) a quick heads-up summarizing what I logged.
Evaluator excerpt (~1,000 lines Python)
Evaluator excerpt (~1,000 lines Python)
Each sample ships with an auto-generated evaluator that scores submissions on two axes:Part A — Environment state comparison: deep-compares the final MCP server state (all 24 database tables) against the reference payload, field by field, using pattern-based classification to handle non-deterministic fields (timestamps, tokens) and semantic matching via LLM for free-text values.Part B — LLM trajectory judge: prompts an LLM to holistically assess whether the agent achieved the user’s goal. The criteria are task-specific:The combined verdict:
Multi-Turn Sample
User scenario
User scenario
Full dialog (35 messages, 6 user turns, 13 tool calls)
Full dialog (35 messages, 6 user turns, 13 tool calls)
Turn 1 — User:Tool result:Step 2 — List threads:I have the account_id. Now I need to search for the Gmail thread ‘Household checklist for this week’ to check if anyone replied.Tool result:Step 3 — Search messages in the matched thread:I have the thread IDs but no titles. I need to search Gmail messages with that query to find the right thread and check for replies.Tool result:Step 4 — Final answer:I have the thread messages. The original email was sent by Ivy. There’s a reply from pat.rivera@example.com. So yes, someone replied — Pat Rivera replied to the thread.
Turn 2 — User:Tool result:Step 2 — Get presentation details:I found the presentation. Now I need to get its details to find the title text box object ID so I can update it.Tool result:Step 3 — Create text box with new title:The slide has no existing objects — no title text box to update. I’ll create a text box with ‘Weekly Revenue Update’ on the first slide.Tool result:Step 4 — Final answer:The slide had no existing objects, so I created a new text box with ‘Weekly Revenue Update’. The task is done.
Turn 3 — User:Tool results:Step 2 — Fetch full details in parallel:I found the spreadsheet ID. Now I need to get the values from the spreadsheet to find total budget and total spent. I also need Parker Chen’s full contact details to find the email. Let me do both in parallel.Tool results:Step 3 — Final answer:Total budgeted: $21,300. Total spent: $20,845. Parker has two emails. I should present the budget numbers and ask which email to use for sharing.
Turn 4 — User:Tool results:Step 2 — Final answer:The spreadsheet has been shared. The calendar is wide open next week — no busy periods at all. I’ll pick two reasonable 30-min slots and draft an email to Parker proposing those times. I need to show the draft to the user before sending.
Turn 5 — User:Tool result:Step 2 — Final answer:The email was sent successfully. I should confirm this to the user and let them know everything is done.
Hey — can you check my Gmail thread “Household checklist for this week” and tell me if anyone replied?Assistant:Step 1 — Clarify account:The user wants to check a Gmail thread titled ‘Household checklist for this week’. I need their account_id first. I don’t have it, so I should ask for their email address to look up their account.
Sure! What’s your email address so I can look up your account?User:
Yep — it’s ivy.bryant@example.net.Assistant:Step 1 — Look up account:I need to look up the user’s account by email first, then search for the Gmail thread.
Yes, someone replied! Pat Rivera (pat.rivera@example.com) responded to your “Household checklist for this week” thread on Feb 23rd. They said they reviewed the details and everything looks good on their end, and asked you to let them know if there’s anything else needed. No reply from Alex Taylor yet though.
Turn 2 — User:
Got it, thanks. Next — can you open my Slides file “Presentation Draft” and change one line? I just need the title to say “Weekly Revenue Update” instead of whatever it is now.Assistant:Step 1 — List presentations:User wants to update the title in their ‘Presentation Draft’ slides file. I need to find the presentation first.
Done! The first slide in “Presentation Draft” didn’t have an existing title text box, so I created one with “Weekly Revenue Update” on it. The presentation is all set!
Turn 3 — User:
Perfect. Next can you open my Sheet “Holiday Budget Planner” and tell me what the total budget and total spent numbers are? And after that I need to share the spreadsheet with Parker Chen — can you pull up Parker’s email first and confirm it with me before you send anything?Assistant:Step 1 — Search spreadsheet and contacts in parallel:User wants two things: 1) Get values from ‘Holiday Budget Planner’ spreadsheet, and 2) Look up Parker Chen’s contact info. These are independent so I can batch them.
Here’s what I found in the Holiday Budget Planner: Total Budget: $21,300 Total Spent: $20,845 Variance: -$455 under budget overall — looking pretty good! Now for Parker Chen, I found two email addresses on file:Which one should I use to share the spreadsheet? And what access level — viewer, commenter, or editor?
- parker.chen@example.com (home)
- parker.chen@example.org (work)
Turn 4 — User:
Use parker.chen@example.org (work) please — can you share it as viewer access? Also after you do that, can you check my calendar for any 30-min free slots next week for a follow-up call and then draft an email to Parker proposing a couple options (show me the draft before sending).Assistant:Step 1 — Share file and check calendar in parallel:I can do the share and find free time in parallel since they’re independent. I’ll do those first, then draft the email based on the free slots found.
Done — the Holiday Budget Planner is now shared with parker.chen@example.org as a viewer! Your calendar looks wide open next week, so I picked two convenient 30-minute slots: Option 1: Tuesday, March 4 at 10:00–10:30 AM Option 2: Thursday, March 6 at 2:00–2:30 PM Here’s the draft email for your review: To: parker.chen@example.org Subject: Follow-Up Call — Holiday Budget Planner Hi Parker, I wanted to follow up on the Holiday Budget Planner I just shared with you. Could we hop on a quick 30-minute call to review the numbers? Here are two options…
Turn 5 — User:
Looks good — can you tweak the subject to “Follow-up call next week” and change the first line to “I just shared the Holiday Budget Planner with you — could we do a quick 30-min call to review?” Everything else can stay. Then go ahead and send it.Assistant:Step 1 — Send the email:User approved the draft with two tweaks: subject changed to ‘Follow-up call next week’ and first line changed. I’ll send the email now with the updated content.
Email sent! Parker will receive it at parker.chen@example.org with the updated subject and opening line. All done — the spreadsheet is shared, the calendar slots are identified, and the email is on its way!
Evaluator excerpt (~950 lines Python)
Evaluator excerpt (~950 lines Python)
Same two-axis evaluation as single-turn, but with multi-turn-specific criteria:Part A — Environment state comparison: identical deep-compare of final MCP server state against reference.Part B — LLM trajectory judge with task-specific criteria:
Download
Browse on Hugging Face
View Google Workspace files