RL - Documentation

Agent RL trains a language model using reinforcement learning (RL). Instead of learning from labeled examples, the model learns by attempting tasks and receiving reward signals — allowing it to optimize for goals that are difficult to capture with supervised data alone.

Prerequisites

An EigenAI account with available credits.
A prompt dataset in JSONL format (each line must contain a messages array).

Create an RL job

Click Fine-Tune a Model on the Fine-tuning page, then select the RL tab to open the 5-step wizard.

Step 1 — Model

Select a Base model to initialize the agent policy network.

Model	Price
Qwen3-4B-Instruct-2507	$0.4 / M tokens
Qwen3-4B-Thinking-2507	$0.4 / M tokens
Qwen3-30B-A3B-Instruct-2507	$2.8 / M tokens
Qwen3-30B-A3B-Thinking-2507	$2.8 / M tokens

Step 2 — Dataset

Upload a prompt dataset. Each line in your JSONL file must contain a messages array.

Option	Description
File upload	Drag and drop a `.jsonl` file or click to browse.
Select existing dataset	Reuse a dataset you have already uploaded.

Your file should follow one of these formats depending on the reward type you plan to use:

// Basic — prompts only
{"messages": [{"role": "user", "content": "Search for AI news"}]}

// With expected answer — for Task Completion / JSON Validation reward
{"messages": [...], "metadata": {"expected_answer": "4"}}

// With rubric — for LLM-as-a-Judge reward
{"messages": [...], "metadata": {"rubric": "Rate quality"}}

// With expected tools — for Tool Usage reward
{"messages": [...], "metadata": {"expected_tools": ["search", "calc"]}}

Fields you put in metadata become available in your reward function as item["metadata"]["your_field"]. Design your dataset with your reward logic in mind.

EigenAI also provides sample datasets to help you get started quickly:

Sample Dataset	Description
[Sample] DeepCoder 1K	Coding tasks for code generation reward
[Sample] DAPO-Math 17K	Math problems for equivalence checking
[Sample] NQ + HotpotQA 1K	Search and QA tasks
[Sample] SimpleQA 1K	Short-answer factual questions

Step 3 — Tools

Configure how the agent interacts during training and which external tools it can access.

Conversation Mode

Mode	Description
Single-turn	Agent completes the task in one response (User → Agent → Done)
Multi-turn	Agent and a simulated user have a back-and-forth conversation. (Coming soon)

MCP Tool Servers (Optional)

Add MCP servers to give your agent access to external tools during training. Click Add MCP Server and provide the server configuration. You can skip this section if your training doesn’t require tools.

Training Parameters

Parameter	Default	Description
Batch Size	4	Number of prompts per rollout batch. Options: 1, 2, 4, 8, 16, 32.
Epochs	1	Number of passes over the dataset.
Samples per Prompt	8	Number of rollouts generated per prompt per step.
Temperature	0.8	Sampling temperature for agent rollouts.
Max Response Length	8192	Maximum token length for each agent response.
Save Checkpoint Every	—	Save a checkpoint every N rollouts. Leave empty to save once per epoch (default).

Step 4 — Reward

Define one or more Python reward functions that score the agent’s output. You can add multiple reward functions with different weights to combine signals.

Preset Reward Suites

Select a preset to auto-generate a starting reward function:

Preset	Description
Math (Recommended)	Math equivalence checking — compares the agent’s final answer to `expected_answer` in metadata.
Coding	Python unit test execution — runs the agent’s code output against test cases.
Search Agent	LLM-as-a-Judge for search and QA correctness.
Tool-use Agent	LLM-as-a-Judge for evaluating tool usage quality.
Custom	Write your own reward function from scratch.

Writing a Reward Function

Each reward function must implement a grade function with the following signature:

from typing import Any

def grade(sample: Any, item: Any) -> float:
    """
    Evaluate agent trajectory and return a reward score.

    Args:
        sample: Agent output generated during training.
            - trajectories: list of steps, each with:
                - response: str — agent's text output
                - tool_calls: list[dict] — functions the agent called
                - tool_results: list — results from tool execution
        item: One row from your uploaded JSONL dataset.
            - prompt: the original user message
            - metadata: your custom fields (e.g. expected_answer, rubric)

    Returns:
        float: reward score for this trajectory
    """
    ...

EigenMagic Create

Describe what you want to reward in plain language, and EigenMagic will generate the Python code for you. For example:

“Compare the final answer in the agent’s response to the expected_answer in my dataset.”

EigenMagic Review

Before proceeding, click EigenMagic Review to validate your reward function. The reviewer checks for issues such as missing imports, runtime errors, or logic problems. You may continue regardless of the verdict.

Environment Variables

If your reward function calls an external API (e.g., an LLM judge), add the required API keys as environment variables. They will be available via os.environ at runtime.

Step 5 — Review

Review the full job configuration before launching.

Field	Description
Base Model	The selected base policy model.
Dataset	The prompt dataset used for training.
Mode	Conversation mode (e.g., Single-turn).
MCP Servers	Number of configured MCP servers and tools.
Reward Type	The reward function type (e.g., Python Function).
Epochs	Number of training epochs.
Batch Size	Number of prompts per rollout batch.
Samples per Prompt	Number of rollouts per prompt.
Temperature	Sampling temperature.

Check the acknowledgment box and click Confirm & Create to launch the job.

RL job details

Click any RL job in the fine-tuning list to open its detail page.

Configuration

Shows the full job configuration:

Status, Training mode, Base model, Training dataset, Evaluation dataset
Batch size, Learning rate, Queue position
Created / Started / Completed timestamps

RL Configuration

Shows RL-specific settings used for the job:

Samples/prompt, Temperature, Max response len, Reward function (Python code), Reward sources

Progress

Shows live training progress: completion percentage, current step out of total steps, and a summary of the latest Loss, Grad norm, and Learning rate values.

Evaluation Snapshot

Shows the latest rollout evaluation metrics at the most recent logged step.

Metric	Description
Advantages	Mean advantage value across rollouts.
Eval Response Lengths	Average token length of evaluation responses.
Eval Reward	Average reward score on evaluation rollouts.

Use the Step Range inputs to filter metrics to a specific step window.

Rollout Metrics

Real-time charts updated as training progresses. All charts plot values over rollout steps.

Chart	Description
Reward	Mean reward score per rollout step.
Response Lengths	Average token length of agent responses per rollout step.
KL Divergence	KL divergence between the policy and reference model per rollout step.
Advantages	Mean advantage value per rollout step.

Training Metrics

Chart	Description
Loss	Training loss over steps. Includes EMA smoothing control.
Entropy Loss	Entropy of the policy output over steps.
Gradient Norm	Gradient norm over steps.
Learning Rate	Learning rate schedule over steps.

The latest values are shown as a summary line (e.g., loss 4.11e-5 • grad 0.505 • lr 2.00e-6 @ step 101). Source data comes from training.log.

Rollout Explorer

Browse individual rollout steps. Each step shows the number of prompts, total samples, average reward, and completion status. Click any step to inspect individual rollout trajectories.

Model Checkpoints

Checkpoints are saved according to the Save Checkpoint Every setting (or once per epoch by default). Each checkpoint is labeled by the training step at which it was saved.

Field	Description
Step N	Checkpoint label (e.g., Step 1, Step 3, …).
HuggingFace	Format of the saved weights.
Files / Size / Step	Number of files, total size, and the training step.

Each checkpoint has two buttons:

Details — View the full list of files in the checkpoint.
Deploy — Create an inference deployment directly from this checkpoint. See Deployments for details.

Additional files

File	Description
`checkpoint_status.json`	Metadata about checkpoint state.
`training.log`	Full training log file.
`raw_output.log`	Raw process output log.
`rollouts.jsonl`	All rollout trajectories in JSONL format.
`global_dataset_state_dict_*.pt`	Dataset state snapshots for resuming training.

Click Download next to any file to save it locally.

Logs

The Logs section displays the last 200 lines of real-time training output. Click Refresh to fetch the latest lines.

​Prerequisites

​Create an RL job

​Step 1 — Model

​Step 2 — Dataset

​Step 3 — Tools

​Conversation Mode

​MCP Tool Servers (Optional)

​Training Parameters

​Step 4 — Reward

​Preset Reward Suites

​Writing a Reward Function

​EigenMagic Create

​EigenMagic Review

​Environment Variables

​Step 5 — Review

​RL job details

​Configuration

​RL Configuration

​Progress

​Evaluation Snapshot

​Rollout Metrics

​Training Metrics

​Rollout Explorer

​Model Checkpoints

​Additional files

​Logs

Prerequisites

Create an RL job

Step 1 — Model

Step 2 — Dataset

Step 3 — Tools

Conversation Mode

MCP Tool Servers (Optional)

Training Parameters

Step 4 — Reward

Preset Reward Suites

Writing a Reward Function

EigenMagic Create

EigenMagic Review

Environment Variables

Step 5 — Review

RL job details

Configuration

RL Configuration

Progress

Evaluation Snapshot

Rollout Metrics

Training Metrics

Rollout Explorer

Model Checkpoints

Additional files

Logs