Skip to main content
Agent RL trains a language model using reinforcement learning (RL). Instead of learning from labeled examples, the model learns by attempting tasks and receiving reward signals — allowing it to optimize for goals that are difficult to capture with supervised data alone.

Prerequisites

  • An EigenAI account with available credits.
  • A prompt dataset in JSONL format (each line must contain a messages array).

Create an RL job

Click Fine-Tune a Model on the Fine-tuning page, then select the RL tab to open the 5-step wizard.

Step 1 — Model

Select a Base model to initialize the agent policy network.
ModelPrice
Qwen3-4B-Instruct-2507$0.4 / M tokens
Qwen3-4B-Thinking-2507$0.4 / M tokens
Qwen3-30B-A3B-Instruct-2507$2.8 / M tokens
Qwen3-30B-A3B-Thinking-2507$2.8 / M tokens

Step 2 — Dataset

Upload a prompt dataset. Each line in your JSONL file must contain a messages array.
OptionDescription
File uploadDrag and drop a .jsonl file or click to browse.
Select existing datasetReuse a dataset you have already uploaded.
Your file should follow one of these formats depending on the reward type you plan to use:
// Basic — prompts only
{"messages": [{"role": "user", "content": "Search for AI news"}]}

// With expected answer — for Task Completion / JSON Validation reward
{"messages": [...], "metadata": {"expected_answer": "4"}}

// With rubric — for LLM-as-a-Judge reward
{"messages": [...], "metadata": {"rubric": "Rate quality"}}

// With expected tools — for Tool Usage reward
{"messages": [...], "metadata": {"expected_tools": ["search", "calc"]}}
Fields you put in metadata become available in your reward function as item["metadata"]["your_field"]. Design your dataset with your reward logic in mind.
EigenAI also provides sample datasets to help you get started quickly:
Sample DatasetDescription
[Sample] DeepCoder 1KCoding tasks for code generation reward
[Sample] DAPO-Math 17KMath problems for equivalence checking
[Sample] NQ + HotpotQA 1KSearch and QA tasks
[Sample] SimpleQA 1KShort-answer factual questions

Step 3 — Tools

Configure how the agent interacts during training and which external tools it can access.

Conversation Mode

ModeDescription
Single-turnAgent completes the task in one response (User → Agent → Done)
Multi-turnAgent and a simulated user have a back-and-forth conversation. (Coming soon)

MCP Tool Servers (Optional)

Add MCP servers to give your agent access to external tools during training. Click Add MCP Server and provide the server configuration. You can skip this section if your training doesn’t require tools.

Training Parameters

ParameterDefaultDescription
Batch Size4Number of prompts per rollout batch. Options: 1, 2, 4, 8, 16, 32.
Epochs1Number of passes over the dataset.
Samples per Prompt8Number of rollouts generated per prompt per step.
Temperature0.8Sampling temperature for agent rollouts.
Max Response Length8192Maximum token length for each agent response.
Save Checkpoint EverySave a checkpoint every N rollouts. Leave empty to save once per epoch (default).

Step 4 — Reward

Define one or more Python reward functions that score the agent’s output. You can add multiple reward functions with different weights to combine signals.

Preset Reward Suites

Select a preset to auto-generate a starting reward function:
PresetDescription
Math (Recommended)Math equivalence checking — compares the agent’s final answer to expected_answer in metadata.
CodingPython unit test execution — runs the agent’s code output against test cases.
Search AgentLLM-as-a-Judge for search and QA correctness.
Tool-use AgentLLM-as-a-Judge for evaluating tool usage quality.
CustomWrite your own reward function from scratch.

Writing a Reward Function

Each reward function must implement a grade function with the following signature:
from typing import Any

def grade(sample: Any, item: Any) -> float:
    """
    Evaluate agent trajectory and return a reward score.

    Args:
        sample: Agent output generated during training.
            - trajectories: list of steps, each with:
                - response: str — agent's text output
                - tool_calls: list[dict] — functions the agent called
                - tool_results: list — results from tool execution
        item: One row from your uploaded JSONL dataset.
            - prompt: the original user message
            - metadata: your custom fields (e.g. expected_answer, rubric)

    Returns:
        float: reward score for this trajectory
    """
    ...

EigenMagic Create

Describe what you want to reward in plain language, and EigenMagic will generate the Python code for you. For example:
“Compare the final answer in the agent’s response to the expected_answer in my dataset.”

EigenMagic Review

Before proceeding, click EigenMagic Review to validate your reward function. The reviewer checks for issues such as missing imports, runtime errors, or logic problems. You may continue regardless of the verdict.

Environment Variables

If your reward function calls an external API (e.g., an LLM judge), add the required API keys as environment variables. They will be available via os.environ at runtime.

Step 5 — Review

Review the full job configuration before launching.
FieldDescription
Base ModelThe selected base policy model.
DatasetThe prompt dataset used for training.
ModeConversation mode (e.g., Single-turn).
MCP ServersNumber of configured MCP servers and tools.
Reward TypeThe reward function type (e.g., Python Function).
EpochsNumber of training epochs.
Batch SizeNumber of prompts per rollout batch.
Samples per PromptNumber of rollouts per prompt.
TemperatureSampling temperature.
Check the acknowledgment box and click Confirm & Create to launch the job.

RL job details

Click any RL job in the fine-tuning list to open its detail page.

Configuration

Shows the full job configuration:
  • Status, Training mode, Base model, Training dataset, Evaluation dataset
  • Batch size, Learning rate, Queue position
  • Created / Started / Completed timestamps

RL Configuration

Shows RL-specific settings used for the job:
  • Samples/prompt, Temperature, Max response len, Reward function (Python code), Reward sources

Progress

Shows live training progress: completion percentage, current step out of total steps, and a summary of the latest Loss, Grad norm, and Learning rate values.

Evaluation Snapshot

Shows the latest rollout evaluation metrics at the most recent logged step.
MetricDescription
AdvantagesMean advantage value across rollouts.
Eval Response LengthsAverage token length of evaluation responses.
Eval RewardAverage reward score on evaluation rollouts.
Use the Step Range inputs to filter metrics to a specific step window.

Rollout Metrics

Real-time charts updated as training progresses. All charts plot values over rollout steps.
ChartDescription
RewardMean reward score per rollout step.
Response LengthsAverage token length of agent responses per rollout step.
KL DivergenceKL divergence between the policy and reference model per rollout step.
AdvantagesMean advantage value per rollout step.

Training Metrics

ChartDescription
LossTraining loss over steps. Includes EMA smoothing control.
Entropy LossEntropy of the policy output over steps.
Gradient NormGradient norm over steps.
Learning RateLearning rate schedule over steps.
The latest values are shown as a summary line (e.g., loss 4.11e-5 • grad 0.505 • lr 2.00e-6 @ step 101). Source data comes from training.log.

Rollout Explorer

Browse individual rollout steps. Each step shows the number of prompts, total samples, average reward, and completion status. Click any step to inspect individual rollout trajectories.

Model Checkpoints

Checkpoints are saved according to the Save Checkpoint Every setting (or once per epoch by default). Each checkpoint is labeled by the training step at which it was saved.
FieldDescription
Step NCheckpoint label (e.g., Step 1, Step 3, …).
HuggingFaceFormat of the saved weights.
Files / Size / StepNumber of files, total size, and the training step.
Each checkpoint has two buttons:
  • Details — View the full list of files in the checkpoint.
  • Deploy — Create an inference deployment directly from this checkpoint. See Deployments for details.

Additional files

FileDescription
checkpoint_status.jsonMetadata about checkpoint state.
training.logFull training log file.
raw_output.logRaw process output log.
rollouts.jsonlAll rollout trajectories in JSONL format.
global_dataset_state_dict_*.ptDataset state snapshots for resuming training.
Click Download next to any file to save it locally.

Logs

The Logs section displays the last 200 lines of real-time training output. Click Refresh to fetch the latest lines.