Skip to main content
Data Generate generates synthetic or sample data from scratch based on a provided MCP (schema/config), producing records that conform to the defined structure and constraints. You describe what kind of data you need, and the system generates realistic conversations tailored to your specifications.

Two-Phase Generation Strategy

Generation uses a two-phase workflow to ensure quality:
  1. Phase 1: Pilot Optimization: Runs several iterations, generating small batches for validation and refinement. You can provide feedback to improve or customize your personalized generation agentic system.
  2. Phase 2: Large-Scale Generation with Online Monitoring: Uses the optimized generation agentic system from Phase 1 to generate your complete dataset at scale.

Parameters

ParameterRequiredDescription
domainYesThe problem space or topic area (see Domain)
requestYesDescription of data to generate
final_samplesYesNumber of samples to generate
schema_fileOne ofPath to a local schema file
mcp_server_urlOne ofURL of an MCP server providing the schema
data_languageNoLanguage mix specification
reference_docNoPath to reference documentation
You must provide either schema_file or mcp_server_url as the function schema source.
This feature uses the standard chatbot interaction flow. Describe what data you want to generate in natural language, and the CLI guides you through parameter collection and confirmation.

Phase 1: Pilot Optimization

After you confirm the configuration, Phase 1 begins. The system runs several iterations, each producing a small batch of samples for validation. The CLI displays a live progress tracker showing the current iteration, elapsed time, and workflow pipeline status.
Live progress tracker showing iteration, elapsed time, and pipeline status

Provide Feedback

During each Phase 1 iteration, the CLI opens a browser-based review interface displaying the generated samples. You can:
  • Review each generated conversation
  • Add per-sample feedback in the provided text areas
  • Submit with no feedback if the samples look good
Browser-based review interface for providing feedback on generated samples
Your feedback is summarized and used to refine the generation strategy for subsequent iterations, progressively improving output quality.

Phase 2: Large-Scale Generation with Online Monitoring

Phase 2 starts automatically once Phase 1 completes. The system uses the optimized generation strategy to produce your full dataset at the target scale, processing in batches. The CLI continues showing progress until the run finishes.

Completion

Upon completion:
Data generation completion summary with output paths
The sampling config file can be reused for future resample operations to skip Phase 1.

Resample: Quick Generation with Saved Config

If you are satisfied with a previous generation, use resample to skip Phase 1:
Resample mode using saved configuration to skip Phase 1

Output

After a run completes, results are saved under outputs/ as a new run directory, for example:
  • outputs/generated_data_<run_id>/
Inside the run directory:
  • generated_data.jsonl - The generated dataset in JSONL format
  • datapoints/ - One JSON file per sample (expanded view of generated_data.jsonl)
  • metadata.json - Run metadata (task type, parameters, primary files, timestamps)
The viewer can browse and render these outputs.

Example Output: Tau2-Bench Airline Domain

Below is an example of 20 multi-turn function-calling dialogs generated for the Tau2-Bench airline customer service domain. Each sample is a realistic conversation involving booking, modification, cancellation, refunds, and baggage handling — with tool calls to airline backend APIs.
User:
Hi, this is Emma Smith, user id emma_smith_9363. I have three things I need handled across my bookings: 1) Please cancel reservation E9TZTU with a full refund. 2) Change the flights for reservation 1YVFB9 to travel on 2024-05-28 instead of 2024-05-27 AND upgrade to business class. 3) Add 2 checked bags to reservation …
Assistantget_user_details({"user_id": "emma_smith_9363"})Tool response:
{"user_id": "emma_smith_9363", "name": {"first_name": "Emma", "last_name": "Smith"}, "membership": "silver", ...}
Assistantget_reservation_details({"reservation_id": "E9TZTU"})Tool response:
{"reservation_id": "E9TZTU", "origin": "PHL", "destination": "DTW", "cabin": "basic_economy", ...}
Assistant:
I’ve pulled up your details. Let me walk through each request. For reservation E9TZTU — this is a basic economy round-trip flight from PHL to DTW. I need to let you know that basic economy reservations can only receive a refund as travel certificates, not a full cash refund …
Dataset statistics:
  • 20 samples, averaging 37 messages per dialog
  • 4–20 tool calls per conversation across functions like get_user_details, get_reservation_details, cancel_reservation, update_reservation_flights, book_reservation, and more
  • Covers diverse scenarios: cancellation, flight modification, baggage updates, refunds, payment handling, insurance, and membership-related requests

Using /execute

You can also run data-generate non-interactively via /execute with a YAML config. Prerequisites
  • You have a YAML configuration file available.
  • You provide a schema source (schema_file or mcp_server_url).
task: data-generate
domain: airline customer service
request: Generate 10 multi-turn function-calling dialogs with diverse tool usage.
final_samples: 10
mcp_server_url: http://127.0.0.1:8009