EPO
The code for paper "EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning"
Install / Use
/learn @WujiangXu/EPOREADME
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning
<p align="center"> <a href="https://arxiv.org/abs/2509.22576"> <img src="https://img.shields.io/badge/arXiv-2509.22576-b31b1b?style=flat-square&logo=arxiv" alt="arXiv Paper"></a> <a href="https://api.wandb.ai/links/ruwujiang-rutgers-university/bhmi2ly5"> <img src="https://img.shields.io/badge/W%26B-Experiments-FFBE00?style=flat-square&logo=weightsandbiases&logoColor=white" alt="Weights & Biases"></a> <a href="#"> <img src="https://img.shields.io/badge/GitHub-Project-181717?style=flat-square&logo=github" alt="GitHub Project"></a> </p>Abstract and Results
This repository contains the implementation of EPO (Entropy-regularized Policy Optimization), a novel approach for training large language model (LLM) agents through reinforcement learning that introduces entropy regularization to improve training stability and generalization performance in multi-turn agent environments. Our experimental results demonstrate significant improvements over baseline methods, with EPO-enhanced methods achieving substantially higher reward accumulation while maintaining superior stability. In ScienceWorld, PPO+EPO reaches approximately 2x higher training rewards (15 vs. 8) with smooth monotonic trajectories, while ALFWorld shows consistent improvements with GRPO+EPO maintaining steady upward trends throughout training. The validation performance reveals rapid convergence to high success rates (>0.8 for both IID and OOD) within 40 training steps, compared to baseline methods that struggle to exceed 0.4 even after 100 steps.
<p align="center"> <img src="./fig/ppo_sciworld_PPO_vs_+EPO_episode_success_rate.png" alt="Training Success Rate" width="30%"> <img src="./fig/ppo_sciworld_PPO_vs_+EPO_val_l0_success_rate.png" alt="IID Validation" width="30%"> <img src="./fig/ppo_sciworld_PPO_vs_+EPO_val_l1_success_rate.png" alt="OOD Validation" width="30%"> </p>(a-c) ScienceWorld experimental results contrasting PPO and PPO+EPO performance across training reward accumulation, IID validation, and OOD validation metrics.
Key Features
- Entropy-regularized Policy Optimization: Novel RL algorithm that incorporates entropy regularization for improved training dynamics
- Multi-turn Agent Training: Supports long-horizon, multi-step agent-environment interactions
- Enhanced Generalization: Achieves superior performance on both in-distribution (IID) and out-of-distribution (OOD) evaluation settings
- Stable Training Dynamics: Provides smooth, monotonic training trajectories with improved convergence
- Environment Support: Compatible with ScienceWorld, ALFWorld, and other agent environments
Installation
Prerequisites
- Python 3.10 or 3.12
- CUDA-compatible GPU
- Conda or virtual environment
Environment Setup
For ScienceWorld
Follow the instructions in instruct_to_run_sciworld.sh:
# Step 1: Create and activate environment
python3.10 -m venv /common/users/cj574/env/verl-agent-sciworld/
source /common/users/cj574/env/verl-agent-sciworld/bin/activate
# Step 2: Install ScienceWorld environment
pip3 install scienceworld
pip install gym==0.23.1
pip install selenium
# Step 3: Install verl-agent dependencies
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install wheel
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.5
For ALFWorld
Follow the instructions in instruct_to_run_alfworld.sh:
# Step 1: Create and activate environment
conda create -n verl-agent-alfworld python==3.12 -y
conda activate verl-agent-alfworld
# Step 2: Install ALFWorld environment
pip3 install gymnasium==0.29.1
pip3 install stable-baselines3==2.6.0
pip install alfworld
alfworld-download -f
# Step 3: Install verl-agent dependencies
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.5
pip3 install pandas==2.2.3
Usage
Running Experiments
The repository provides example scripts for running EPO-enhanced training:
# ScienceWorld PPO + EPO
bash examples/general_running_server.sh --environment sciworld --rl_algorithm "ppo" --seed 0 --lr 3e-6 --lr_warmup_steps_ratio 0.1 --min_lr_ratio 0.2 --warmup_style cosine --entropy_smooth True --enable_smooth_weights True --entropy_smooth_mask_mode "token" --entropy_smooth_min_ratio 0 --entropy_smooth_max_ratio 2.0 --entropy_smooth_out_range_penalty 0.05 --model_path "/local_path/7b_model" --model_load_method "local" --log_prob_micro_batch_size_per_gpu 8 --ppo_micro_batch_size_per_gpu 8 --ppo_mini_batch_size 64 --total_epochs 125
# ALFWorld PPO + EPO
bash examples/general_running_server.sh --environment alfworld --rl_algorithm "ppo" --seed 1 --lr 5e-6 --lr_warmup_steps_ratio 0.1 --min_lr_ratio 0.2 --warmup_style cosine --entropy_smooth True --entropy_smooth_mask_mode "token" --entropy_smooth_min_ratio 0 --entropy_smooth_max_ratio 2.0 --entropy_smooth_out_range_penalty 0.1 --model_path "/local_path/3b_model" --model_load_method "local" --enable_smooth_weights True
Experiment Tracking
We use Weights & Biases to track our experiments. View our experimental results and training metrics:
Citation
If you use this code or find our work helpful, please cite:
@article{xu2025epo,
title={EPO: Entropy-Regularized Policy Optimization for LLM Agents Reinforcement Learning},
author={Xu, Wujiang and Zhao, Wentian and Wang, Zhenting and Li, Yu-Jhe and Jin, Can and Jin, Mingyu and Mei, Kai and Wan, Kun and Metaxas, Dimitris N.},
journal={arXiv preprint arXiv:2509.22576},
year={2025}
}
Acknowledgments
We gratefully acknowledge the following projects that made this research possible:
- veRL — Volcano Engine Reinforcement Learning framework for LLMs
- verl-agent — Scalable training framework for long-horizon LLM/VLM agents
- ScienceWorld — Interactive text-based science environment for agent training
- ALFWorld — Text-based embodied AI environment for household tasks
- vLLM — High-throughput and memory-efficient inference engine for LLMs
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
workshop-rules
Materials used to teach the summer camp <Data Science for Kids>
last30days-skill
19.8kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
000-main-rules
Project Context - Name: Interactive Developer Portfolio - Stack: Next.js (App Router), TypeScript, React, Tailwind CSS, Three.js - Architecture: Component-driven UI with a strict separation of conce
