SkillAgentSearch skills...

RAGEN

RAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments.

Install / Use

/learn @mll-lab-nu/RAGEN
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<h1 align="center">RAGEN: Training Agents by Reinforcing Reasoning</h1> <h3 align="center"><em>Diagnose agent failure modes. Make your RL training better.</em></h3> <p align="center"><img src="public/ragen_logo.jpeg" width="300px" alt="RAGEN icon" /></p> <p align="center"> <strong>RAGEN</strong> (<b>R</b>easoning <b>AGEN</b>T) is a flexible RL framework for training reasoning agents. </p> <p align="center"> We develop <strong>diagnostics to understand <i>how</i> agent RL training works </strong>, and how to fix hidden issues. </p> <p align="center"> <a href="https://ragen-ai.github.io/v2/pdf/RAGEN-v2.pdf"><img src="https://img.shields.io/badge/📄_V2_Paper-DC143C?style=for-the-badge&logoColor=white" alt="V2 Paper"></a> <a href="https://arxiv.org/abs/2504.20073"><img src="https://img.shields.io/badge/📄_v1_Paper-FF8C00?style=for-the-badge&logoColor=white" alt="v1 Paper"></a> <a href="https://ragen-ai.github.io/"><img src="https://img.shields.io/badge/📝_HomePage-FF5722?style=for-the-badge&logoColor=white" alt="Blog"></a> <a href="https://ragen-doc.readthedocs.io/"><img src="https://img.shields.io/badge/📚_Documentation-4285F4?style=for-the-badge&logoColor=white" alt="Documentation"></a> <a href="https://x.com/wzihanw/status/1915052871474712858"><img src="https://img.shields.io/badge/🔍_Post-34A853?style=for-the-badge&logoColor=white" alt="Post"></a> <a href="https://api.wandb.ai/links/zihanwang-ai-northwestern-university/a8er8l7b"><img src="https://img.shields.io/badge/🧪_Experiment_Log-AB47BC?style=for-the-badge&logoColor=white" alt="Experiment Log"></a> </p>

Looking for the V1 README? Please take a look here.

News

  • 2026.3.12. We are excited to release <font color="#DC143C">RAGEN V2</font>! We introduce a systematic study of reasoning collapse in agent RL and lightweight interventions for stable training. See the <font color="#DC143C">v2 paper</font>.
  • 2025.4.20. RAGEN V1 paper published on arXiv.
  • 2025.1.27. Initial RAGEN release. Post.

About

RAGEN is built around StarPO (State-Thinking-Actions-Reward Policy Optimization), a unified RL framework for training multi-turn, trajectory-level agents with flexible control over reasoning processes, reward assignment mechanisms, and prompt-rollout structures.

RAGEN is flexible with:

  • StarPO framework. Unified optimization for multi-turn agents, supporting both trajectory-level and turn-wise training.
  • 10 built-in environments. Sokoban, FrozenLake, WebShop, DeepCoder, SearchQA, Lean, Bandit, Countdown, MetaMathQA, Sudoku.
  • Gym-compatible interface. Easy to add custom environments.

<font color="#DC143C">RAGEN V2</font> additionally introduces:

  • SNR-Adaptive Filtering (<font color="#DC143C">V2</font>). Lightweight rollout filtering based on reward variance to mitigate noisy gradient updates.
  • Reasoning collapse diagnostics (<font color="#DC143C">V2</font>). Mutual information proxy metrics to detect and monitor template collapse during training.

Algorithm

StarPO: Reinforcing Reasoning via Trajectory-Level Optimization

<p align="center"><img src="public/starpo_logo.png" width="800px" alt="StarPO Framework" /></p> <p align="center" style="font-size: 16px; max-width: 800px; margin: 0 auto;"> The StarPO (State-Thinking-Action-Reward Policy Optimization) framework with two interleaved stages: <b>rollout stage</b> and <b>update stage</b>. The LLM generates reasoning-guided actions to interact with the environment, collecting trajectory-level rewards to jointly optimize reasoning and action strategies. </p>

MDP Formulation. Agent-environment interactions are formulated as Markov Decision Processes (MDPs) where states and actions are token sequences, allowing LLMs to reason over environment dynamics. The objective is to maximize expected cumulative rewards across multiple interaction turns.

Rollout Stage. Given an initial state, the LLM generates multiple trajectories. At each step, the model produces a reasoning-guided action: <think>...</think><ans> action </ans>. The environment returns feedback (reward and next state).

Update Stage. StarPO optimizes entire trajectories using importance sampling. It supports:

  • PPO. Token-level advantage estimation via a value function over trajectories.
  • GRPO. Normalized reward assigned to the full trajectory.

<font color="#DC143C">V2</font>: Diagnosing Template Collapse

Entropy alone cannot detect template collapse, where reasoning appears diverse within a single input but becomes input-agnostic across inputs. <font color="#DC143C">RAGEN V2</font> decomposes reasoning quality into two axes:

  • Within-input diversity: Conditional Entropy H(Z|X)
  • Cross-input distinguishability: Mutual Information I(X;Z)

SNR-Adaptive Filtering uses reward variance as a lightweight proxy to select high-signal prompts each iteration, directly addressing the root cause of template collapse.

Update Log

2026.3.12. <font color="#DC143C">RAGEN V2</font> is released! Check out our <font color="#DC143C">v2 paper</font>.

<details> <summary>Older updates</summary>

2025.5.8. Official Documentation released.

2025.5.2. A tracking document for logging minor codebase updates is released.

2025.4.20. RAGEN V1 paper published. Codebase restructured: veRL integrated as a submodule; architecture decomposed into three modules — Environment State Manager, Context Manager, and Agent Proxy.

2025.3.13. RAGEN codebase refactoring underway. See the developing branch.

2025.3.8. KL term issue in veRL fixed. Default advantage estimator changed to GAE (PPO) for more stable training.

2025.1.27. Initial RAGEN release. Post.

</details>

Getting Started

git clone https://github.com/mll-lab-nu/RAGEN.git
cd RAGEN
conda create -n ragen python=3.12 -y && conda activate ragen
bash scripts/setup_ragen.sh

Use bash scripts/setup_ragen.sh --with-search to include the search environment. For WebShop, see docs/experiment_webshop_release.md.

The Four Reasoning Regimes

<font color="#DC143C">RAGEN V2</font> diagnoses agent behavior along two axes — within-input diversity (Conditional Entropy) and cross-input distinguishability (Mutual Information) — yielding four distinct reasoning regimes:

<p align="center"><img src="public/teaser.png" width="800px" alt="Four reasoning regimes: diverse reasoning, template collapse, compressed reasoning, low-entropy collapse" /></p> <p align="center" style="font-size: 15px; max-width: 800px; margin: 0 auto;"> <b>Left:</b> Input-driven reasoning adapts to the current state; templated reasoning produces nearly identical responses across different inputs. <b>Right:</b> Four reasoning regimes along two axes — conditional entropy H(Z|X) (within-input diversity) and mutual information I(X;Z) (input dependence). Template collapse (high entropy, low MI) is invisible to existing entropy-based metrics. </p>

Train (no filter, default):

python train.py --config-name _2_sokoban

Train with SNR-Adaptive Filtering (<font color="#DC143C">V2</font>, Top-p):

python train.py --config-name _2_sokoban \
  actor_rollout_ref.rollout_filter_strategy=top_p \
  actor_rollout_ref.rollout.rollout_filter_value=0.9

Evaluate:

python -m ragen.llm_agent.agent_proxy --config-name _2_sokoban

SNR-Adaptive Filtering consistently improves training across algorithms, model scales, and modalities (green = gain from filtering):

<p align="center"><img src="public/main_results.png" width="800px" alt="Main results: filtering vs no filtering" /></p>

See the Rollout Filtering Guide for more filtering strategies (Top-k, linear mode, etc.).

Future Plans

We are actively developing the next generation of RAGEN infrastructure and diagnostics, targeting a release in late March 2026.

Infrastructure

  • [ ] Async rollout engine
  • [ ] HTTP-based environment interface
  • [ ] Layered Env Wrapper
  • [ ] Optional environment dependencies

Diagnostics & Training Quality

  • [ ] Expanded benchmark suite to stress-test diagnostics across diverse, real-world agent tasks
  • [ ] Extended MI diagnostic dashboard, including richer WandB visualizations for entropy, MI proxy, and gradient decomposition over training
  • [ ] RL training metrics guide, including a practitioner's blog on how to read training signals (reward distribution, entropy, MI, gradient norms) and act on them before committing to a full run

Framework

  • [ ] Update full documentation for <font color="#DC143C">RAGEN V2</font>
  • [ ] Multi-modal agent support (building upon VAGEN)
  • [ ] Public leaderboard for benchmark results

Documentation

View on GitHub
GitHub Stars2.6k
CategoryEducation
Updated12h ago
Forks210

Languages

Python

Security Score

95/100

Audited on Mar 27, 2026

No findings