EPO

The code for paper "EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning"

Generate Convert Improve

Install / Use

/learn @WujiangXu/EPO

About this skill

Quality Score

0/100

README

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Abstract and Results

This repository contains the implementation of EPO (Entropy-regularized Policy Optimization), a novel approach for training large language model (LLM) agents through reinforcement learning that introduces entropy regularization to improve training stability and generalization performance in multi-turn agent environments. Our experimental results demonstrate significant improvements over baseline methods, with EPO-enhanced methods achieving substantially higher reward accumulation while maintaining superior stability. In ScienceWorld, PPO+EPO reaches approximately 2x higher training rewards (15 vs. 8) with smooth monotonic trajectories, while ALFWorld shows consistent improvements with GRPO+EPO maintaining steady upward trends throughout training. The validation performance reveals rapid convergence to high success rates (>0.8 for both IID and OOD) within 40 training steps, compared to baseline methods that struggle to exceed 0.4 even after 100 steps.

(a-c) ScienceWorld experimental results contrasting PPO and PPO+EPO performance across training reward accumulation, IID validation, and OOD validation metrics.

Key Features

Entropy-regularized Policy Optimization: Novel RL algorithm that incorporates entropy regularization for improved training dynamics
Multi-turn Agent Training: Supports long-horizon, multi-step agent-environment interactions
Enhanced Generalization: Achieves superior performance on both in-distribution (IID) and out-of-distribution (OOD) evaluation settings
Stable Training Dynamics: Provides smooth, monotonic training trajectories with improved convergence
Environment Support: Compatible with ScienceWorld, ALFWorld, and other agent environments

Installation

Prerequisites

Python 3.10 or 3.12
CUDA-compatible GPU
Conda or virtual environment

Environment Setup

For ScienceWorld

Follow the instructions in instruct_to_run_sciworld.sh:

# Step 1: Create and activate environment
python3.10 -m venv /common/users/cj574/env/verl-agent-sciworld/
source /common/users/cj574/env/verl-agent-sciworld/bin/activate

# Step 2: Install ScienceWorld environment
pip3 install scienceworld
pip install gym==0.23.1
pip install selenium

# Step 3: Install verl-agent dependencies
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install wheel
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.5

For ALFWorld

Follow the instructions in instruct_to_run_alfworld.sh:

# Step 1: Create and activate environment
conda create -n verl-agent-alfworld python==3.12 -y
conda activate verl-agent-alfworld

# Step 2: Install ALFWorld environment
pip3 install gymnasium==0.29.1
pip3 install stable-baselines3==2.6.0
pip install alfworld
alfworld-download -f

# Step 3: Install verl-agent dependencies
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.5
pip3 install pandas==2.2.3

Usage

Running Experiments

The repository provides example scripts for running EPO-enhanced training:

# ScienceWorld PPO + EPO
bash examples/general_running_server.sh --environment sciworld --rl_algorithm "ppo" --seed 0 --lr 3e-6 --lr_warmup_steps_ratio 0.1 --min_lr_ratio 0.2 --warmup_style cosine --entropy_smooth True --enable_smooth_weights True --entropy_smooth_mask_mode "token" --entropy_smooth_min_ratio 0 --entropy_smooth_max_ratio 2.0 --entropy_smooth_out_range_penalty 0.05 --model_path "/local_path/7b_model" --model_load_method "local" --log_prob_micro_batch_size_per_gpu 8 --ppo_micro_batch_size_per_gpu 8 --ppo_mini_batch_size 64 --total_epochs 125 

# ALFWorld PPO + EPO  
bash examples/general_running_server.sh --environment alfworld --rl_algorithm "ppo" --seed 1 --lr 5e-6 --lr_warmup_steps_ratio 0.1 --min_lr_ratio 0.2 --warmup_style cosine --entropy_smooth True --entropy_smooth_mask_mode "token" --entropy_smooth_min_ratio 0 --entropy_smooth_max_ratio 2.0 --entropy_smooth_out_range_penalty 0.1 --model_path "/local_path/3b_model" --model_load_method "local" --enable_smooth_weights True

Experiment Tracking

We use Weights & Biases to track our experiments. View our experimental results and training metrics:

📊 EPO Experiments Dashboard

Citation

If you use this code or find our work helpful, please cite:

@article{xu2025epo,
  title={EPO: Entropy-Regularized Policy Optimization for LLM Agents Reinforcement Learning},
  author={Xu, Wujiang and Zhao, Wentian and Wang, Zhenting and Li, Yu-Jhe and Jin, Can and Jin, Mingyu and Mei, Kai and Wan, Kun and Metaxas, Dimitris N.},
  journal={arXiv preprint arXiv:2509.22576},
  year={2025}
}

Acknowledgments

We gratefully acknowledge the following projects that made this research possible:

veRL — Volcano Engine Reinforcement Learning framework for LLMs
verl-agent — Scalable training framework for long-horizon LLM/VLM agents
ScienceWorld — Interactive text-based science environment for agent training
ALFWorld — Text-based embodied AI environment for household tasks
vLLM — High-throughput and memory-efficient inference engine for LLMs

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

workshop-rules

Materials used to teach the summer camp <Data Science for Kids>

last30days-skill

19.8k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary

000-main-rules

Project Context - Name: Interactive Developer Portfolio - Stack: Next.js (App Router), TypeScript, React, Tailwind CSS, Three.js - Architecture: Component-driven UI with a strict separation of conce

WujiangXu

View profile

View on GitHub

GitHub Stars37

CategoryEducation

Updated2mo ago

Forks1

WujiangXu/EPO

Languages

Python

Security Score

90/100

Audited on Feb 1, 2026

No findings