<h1 align="center">RAGEN: Training Agents by Reinforcing Reasoning</h1> <h3 align="center">Diagnose agent failure modes. Make your RL training better.</h3> <img src="public/ragen_logo.jpeg" width="300px" alt="RAGEN icon" /> RAGEN (Reasoning AGENT) is a flexible RL framework for training reasoning agents. We develop diagnostics to understand how agent RL training works , and how to fix hidden issues. <a href="https://ragen-ai.github.io/v2/pdf/RAGEN-v2.pdf"><img src="https://img.shields.io/badge/📄_V2_Paper-DC143C?style=for-the-badge&logoColor=white" alt="V2 Paper"></a> <a href="https://arxiv.org/abs/2504.20073"><img src="https://img.shields.io/badge/📄_v1_Paper-FF8C00?style=for-the-badge&logoColor=white" alt="v1 Paper"></a> <a href="https://ragen-ai.github.io/"><img src="https://img.shields.io/badge/📝_HomePage-FF5722?style=for-the-badge&logoColor=white" alt="Blog"></a> <a href="https://ragen-doc.readthedocs.io/"><img src="https://img.shields.io/badge/📚_Documentation-4285F4?style=for-the-badge&logoColor=white" alt="Documentation"></a> <a href="https://x.com/wzihanw/status/1915052871474712858"><img src="https://img.shields.io/badge/🔍_Post-34A853?style=for-the-badge&logoColor=white" alt="Post"></a> <a href="https://api.wandb.ai/links/zihanwang-ai-northwestern-university/a8er8l7b"><img src="https://img.shields.io/badge/🧪_Experiment_Log-AB47BC?style=for-the-badge&logoColor=white" alt="Experiment Log"></a>

Looking for the V1 README? Please take a look here.

News

2026.3.12. We are excited to release RAGEN V2! We introduce a systematic study of reasoning collapse in agent RL and lightweight interventions for stable training. See the v2 paper.
2025.4.20. RAGEN V1 paper published on arXiv.
2025.1.27. Initial RAGEN release. Post.

About

RAGEN is built around StarPO (State-Thinking-Actions-Reward Policy Optimization), a unified RL framework for training multi-turn, trajectory-level agents with flexible control over reasoning processes, reward assignment mechanisms, and prompt-rollout structures.

RAGEN is flexible with:

StarPO framework. Unified optimization for multi-turn agents, supporting both trajectory-level and turn-wise training.
10 built-in environments. Sokoban, FrozenLake, WebShop, DeepCoder, SearchQA, Lean, Bandit, Countdown, MetaMathQA, Sudoku.
Gym-compatible interface. Easy to add custom environments.

RAGEN V2 additionally introduces:

SNR-Adaptive Filtering (V2). Lightweight rollout filtering based on reward variance to mitigate noisy gradient updates.
Reasoning collapse diagnostics (V2). Mutual information proxy metrics to detect and monitor template collapse during training.

Algorithm

StarPO: Reinforcing Reasoning via Trajectory-Level Optimization

<img src="public/starpo_logo.png" width="800px" alt="StarPO Framework" /> The StarPO (State-Thinking-Action-Reward Policy Optimization) framework with two interleaved stages: rollout stage and update stage. The LLM generates reasoning-guided actions to interact with the environment, collecting trajectory-level rewards to jointly optimize reasoning and action strategies.

MDP Formulation. Agent-environment interactions are formulated as Markov Decision Processes (MDPs) where states and actions are token sequences, allowing LLMs to reason over environment dynamics. The objective is to maximize expected cumulative rewards across multiple interaction turns.

Rollout Stage. Given an initial state, the LLM generates multiple trajectories. At each step, the model produces a reasoning-guided action: <think>...</think><ans> action </ans>. The environment returns feedback (reward and next state).

Update Stage. StarPO optimizes entire trajectories using importance sampling. It supports:

PPO. Token-level advantage estimation via a value function over trajectories.
GRPO. Normalized reward assigned to the full trajectory.

V2: Diagnosing Template Collapse

Entropy alone cannot detect template collapse, where reasoning appears diverse within a single input but becomes input-agnostic across inputs. RAGEN V2 decomposes reasoning quality into two axes:

Within-input diversity: Conditional Entropy H(Z|X)
Cross-input distinguishability: Mutual Information I(X;Z)

SNR-Adaptive Filtering uses reward variance as a lightweight proxy to select high-signal prompts each iteration, directly addressing the root cause of template collapse.

Update Log

2026.3.12. RAGEN V2 is released! Check out our v2 paper.

<details> <summary>Older updates</summary>

2025.5.8. Official Documentation released.

2025.5.2. A tracking document for logging minor codebase updates is released.

2025.4.20. RAGEN V1 paper published. Codebase restructured: veRL integrated as a submodule; architecture decomposed into three modules — Environment State Manager, Context Manager, and Agent Proxy.

2025.3.13. RAGEN codebase refactoring underway. See the developing branch.

2025.3.8. KL term issue in veRL fixed. Default advantage estimator changed to GAE (PPO) for more stable training.

2025.1.27. Initial RAGEN release. Post.

</details>

Getting Started

git clone https://github.com/mll-lab-nu/RAGEN.git
cd RAGEN
conda create -n ragen python=3.12 -y && conda activate ragen
bash scripts/setup_ragen.sh

Use bash scripts/setup_ragen.sh --with-search to include the search environment. For WebShop, see docs/experiment_webshop_release.md.

The Four Reasoning Regimes

RAGEN V2 diagnoses agent behavior along two axes — within-input diversity (Conditional Entropy) and cross-input distinguishability (Mutual Information) — yielding four distinct reasoning regimes:

<img src="public/teaser.png" width="800px" alt="Four reasoning regimes: diverse reasoning, template collapse, compressed reasoning, low-entropy collapse" /> Left: Input-driven reasoning adapts to the current state; templated reasoning produces nearly identical responses across different inputs. Right: Four reasoning regimes along two axes — conditional entropy H(Z|X) (within-input diversity) and mutual information I(X;Z) (input dependence). Template collapse (high entropy, low MI) is invisible to existing entropy-based metrics.

Train (no filter, default):

python train.py --config-name _2_sokoban

Train with SNR-Adaptive Filtering (V2, Top-p):

python train.py --config-name _2_sokoban \
  actor_rollout_ref.rollout_filter_strategy=top_p \
  actor_rollout_ref.rollout.rollout_filter_value=0.9

Evaluate:

python -m ragen.llm_agent.agent_proxy --config-name _2_sokoban

SNR-Adaptive Filtering consistently improves training across algorithms, model scales, and modalities (green = gain from filtering):

See the Rollout Filtering Guide for more filtering strategies (Top-k, linear mode, etc.).

Future Plans

We are actively developing the next generation of RAGEN infrastructure and diagnostics, targeting a release in late March 2026.

Infrastructure

[ ] Async rollout engine
[ ] HTTP-based environment interface
[ ] Layered Env Wrapper
[ ] Optional environment dependencies

Diagnostics & Training Quality

[ ] Expanded benchmark suite to stress-test diagnostics across diverse, real-world agent tasks
[ ] Extended MI diagnostic dashboard, including richer WandB visualizations for entropy, MI proxy, and gradient decomposition over training
[ ] RL training metrics guide, including a practitioner's blog on how to read training signals (reward distribution, entropy, MI, gradient norms) and act on them before committing to a full run

Framework

[ ] Update full documentation for RAGEN V2
[ ] Multi-modal agent support (building upon VAGEN)
[ ] Public leaderboard for benchmark results

Documentation

Full Documentation (We will release an updated version soon.)
Rollout Filtering Guide
MI Metrics Reference
Adding Custom Environments — Gym-compatible interface, see config/envs.yaml and documentation
Experiment reproduction: Main Table | Intervention Sweep | FrozenLake | [Sokoban Gradient](docs/experi

RAGEN

Install / Use

README