Agent RL trains a language model using reinforcement learning (RL). Instead of learning from labeled examples, the model learns by attempting tasks and receiving reward signals — allowing it to optimize for goals that are difficult to capture with supervised data alone.
Prerequisites
- An EigenAI account with available credits.
- A prompt dataset in JSONL format (each line must contain a
messages array).
Create an RL job
Click Fine-Tune a Model on the Fine-tuning page, then select the RL tab to open the 5-step wizard.
Step 1 — Model
Select a Base model to initialize the agent policy network.
| Model | Price |
|---|
| Qwen3-4B-Instruct-2507 | $0.4 / M tokens |
| Qwen3-4B-Thinking-2507 | $0.4 / M tokens |
| Qwen3-30B-A3B-Instruct-2507 | $2.8 / M tokens |
| Qwen3-30B-A3B-Thinking-2507 | $2.8 / M tokens |
Step 2 — Dataset
Upload a prompt dataset. Each line in your JSONL file must contain a messages array.
| Option | Description |
|---|
| File upload | Drag and drop a .jsonl file or click to browse. |
| Select existing dataset | Reuse a dataset you have already uploaded. |
Your file should follow one of these formats depending on the reward type you plan to use:
// Basic — prompts only
{"messages": [{"role": "user", "content": "Search for AI news"}]}
// With expected answer — for Task Completion / JSON Validation reward
{"messages": [...], "metadata": {"expected_answer": "4"}}
// With rubric — for LLM-as-a-Judge reward
{"messages": [...], "metadata": {"rubric": "Rate quality"}}
// With expected tools — for Tool Usage reward
{"messages": [...], "metadata": {"expected_tools": ["search", "calc"]}}
Fields you put in metadata become available in your reward function as item["metadata"]["your_field"]. Design your dataset with your reward logic in mind.
EigenAI also provides sample datasets to help you get started quickly:
| Sample Dataset | Description |
|---|
| [Sample] DeepCoder 1K | Coding tasks for code generation reward |
| [Sample] DAPO-Math 17K | Math problems for equivalence checking |
| [Sample] NQ + HotpotQA 1K | Search and QA tasks |
| [Sample] SimpleQA 1K | Short-answer factual questions |
Configure how the agent interacts during training and which external tools it can access.
Conversation Mode
| Mode | Description |
|---|
| Single-turn | Agent completes the task in one response (User → Agent → Done) |
| Multi-turn | Agent and a simulated user have a back-and-forth conversation. (Coming soon) |
Add MCP servers to give your agent access to external tools during training. Click Add MCP Server and provide the server configuration. You can skip this section if your training doesn’t require tools.
Training Parameters
| Parameter | Default | Description |
|---|
| Batch Size | 4 | Number of prompts per rollout batch. Options: 1, 2, 4, 8, 16, 32. |
| Epochs | 1 | Number of passes over the dataset. |
| Samples per Prompt | 8 | Number of rollouts generated per prompt per step. |
| Temperature | 0.8 | Sampling temperature for agent rollouts. |
| Max Response Length | 8192 | Maximum token length for each agent response. |
| Save Checkpoint Every | — | Save a checkpoint every N rollouts. Leave empty to save once per epoch (default). |
Step 4 — Reward
Define one or more Python reward functions that score the agent’s output. You can add multiple reward functions with different weights to combine signals.
Preset Reward Suites
Select a preset to auto-generate a starting reward function:
| Preset | Description |
|---|
| Math (Recommended) | Math equivalence checking — compares the agent’s final answer to expected_answer in metadata. |
| Coding | Python unit test execution — runs the agent’s code output against test cases. |
| Search Agent | LLM-as-a-Judge for search and QA correctness. |
| Tool-use Agent | LLM-as-a-Judge for evaluating tool usage quality. |
| Custom | Write your own reward function from scratch. |
Writing a Reward Function
Each reward function must implement a grade function with the following signature:
from typing import Any
def grade(sample: Any, item: Any) -> float:
"""
Evaluate agent trajectory and return a reward score.
Args:
sample: Agent output generated during training.
- trajectories: list of steps, each with:
- response: str — agent's text output
- tool_calls: list[dict] — functions the agent called
- tool_results: list — results from tool execution
item: One row from your uploaded JSONL dataset.
- prompt: the original user message
- metadata: your custom fields (e.g. expected_answer, rubric)
Returns:
float: reward score for this trajectory
"""
...
EigenMagic Create
Describe what you want to reward in plain language, and EigenMagic will generate the Python code for you. For example:
“Compare the final answer in the agent’s response to the expected_answer in my dataset.”
EigenMagic Review
Before proceeding, click EigenMagic Review to validate your reward function. The reviewer checks for issues such as missing imports, runtime errors, or logic problems. You may continue regardless of the verdict.
Environment Variables
If your reward function calls an external API (e.g., an LLM judge), add the required API keys as environment variables. They will be available via os.environ at runtime.
Step 5 — Review
Review the full job configuration before launching.
| Field | Description |
|---|
| Base Model | The selected base policy model. |
| Dataset | The prompt dataset used for training. |
| Mode | Conversation mode (e.g., Single-turn). |
| MCP Servers | Number of configured MCP servers and tools. |
| Reward Type | The reward function type (e.g., Python Function). |
| Epochs | Number of training epochs. |
| Batch Size | Number of prompts per rollout batch. |
| Samples per Prompt | Number of rollouts per prompt. |
| Temperature | Sampling temperature. |
Check the acknowledgment box and click Confirm & Create to launch the job.
RL job details
Click any RL job in the fine-tuning list to open its detail page.
Configuration
Shows the full job configuration:
- Status, Training mode, Base model, Training dataset, Evaluation dataset
- Batch size, Learning rate, Queue position
- Created / Started / Completed timestamps
RL Configuration
Shows RL-specific settings used for the job:
- Samples/prompt, Temperature, Max response len, Reward function (Python code), Reward sources
Progress
Shows live training progress: completion percentage, current step out of total steps, and a summary of the latest Loss, Grad norm, and Learning rate values.
Evaluation Snapshot
Shows the latest rollout evaluation metrics at the most recent logged step.
| Metric | Description |
|---|
| Advantages | Mean advantage value across rollouts. |
| Eval Response Lengths | Average token length of evaluation responses. |
| Eval Reward | Average reward score on evaluation rollouts. |
Use the Step Range inputs to filter metrics to a specific step window.
Rollout Metrics
Real-time charts updated as training progresses. All charts plot values over rollout steps.
| Chart | Description |
|---|
| Reward | Mean reward score per rollout step. |
| Response Lengths | Average token length of agent responses per rollout step. |
| KL Divergence | KL divergence between the policy and reference model per rollout step. |
| Advantages | Mean advantage value per rollout step. |
Training Metrics
| Chart | Description |
|---|
| Loss | Training loss over steps. Includes EMA smoothing control. |
| Entropy Loss | Entropy of the policy output over steps. |
| Gradient Norm | Gradient norm over steps. |
| Learning Rate | Learning rate schedule over steps. |
The latest values are shown as a summary line (e.g., loss 4.11e-5 • grad 0.505 • lr 2.00e-6 @ step 101). Source data comes from training.log.
Rollout Explorer
Browse individual rollout steps. Each step shows the number of prompts, total samples, average reward, and completion status. Click any step to inspect individual rollout trajectories.
Model Checkpoints
Checkpoints are saved according to the Save Checkpoint Every setting (or once per epoch by default). Each checkpoint is labeled by the training step at which it was saved.
| Field | Description |
|---|
| Step N | Checkpoint label (e.g., Step 1, Step 3, …). |
| HuggingFace | Format of the saved weights. |
| Files / Size / Step | Number of files, total size, and the training step. |
Each checkpoint has two buttons:
- Details — View the full list of files in the checkpoint.
- Deploy — Create an inference deployment directly from this checkpoint. See Deployments for details.
Additional files
| File | Description |
|---|
checkpoint_status.json | Metadata about checkpoint state. |
training.log | Full training log file. |
raw_output.log | Raw process output log. |
rollouts.jsonl | All rollout trajectories in JSONL format. |
global_dataset_state_dict_*.pt | Dataset state snapshots for resuming training. |
Click Download next to any file to save it locally.
Logs
The Logs section displays the last 200 lines of real-time training output. Click Refresh to fetch the latest lines.