LightRFT
LightRFT: Light, Efficient, Omni-modal & Reward-model Driven Reinforcement Fine-Tuning Framework
Install / Use
/learn @opendilab/LightRFTREADME
LightRFT
<div align="center"> <img src="assets/logo.png" alt="LightRFT Logo" width="600"/>Light, Efficient, Omni-modal & Reward-model Driven Reinforcement Fine-Tuning Framework
English | 简体中文
</div>📖 Introduction
LightRFT (Light Reinforcement Fine-Tuning) is an advanced reinforcement learning fine-tuning framework designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). This framework provides efficient and scalable RLHF (Reinforcement Learning from Human Feedback) and RLVR training capabilities, supporting multiple state-of-the-art algorithms and distributed training strategies.
✨ Key Features
-
🚀 High-Performance Inference Engines
- Integrated vLLM and SGLang for efficient sampling and inference
- FP8 inference optimization for significantly reduced latency and memory usage
- Flexible engine sleep/wake mechanisms for optimal resource utilization
-
🧠 Rich Algorithm Ecosystem
- Policy Optimization: GRPO, GSPO, GMPO, Dr.GRPO
- Advantage Estimation: REINFORCE++, CPGD
- Reward Processing: Reward Norm/Clip
- Sampling Strategy: FIRE Sampling, Token-Level Policy
- Stability Enhancement: DAPO, select_high_entropy_tokens
-
🔧 Flexible Training Strategies
- FSDP (Fully Sharded Data Parallel) v2 support
- DeepSpeed ZeRO (Stage 1/2/3) support
- Gradient checkpointing and mixed precision training (BF16/FP16)
- Adam Offload and memory optimization techniques
-
🎯 Innovative Resource Collaboration
- Colocate Anything: Co-locate reward models with training models to maximize GPU utilization
- Support multiple reward models for parallel inference on the same device
- Dynamic memory management with automatic training/inference phase switching
- Reduced cross-device communication overhead for improved end-to-end training efficiency
- Balance Anything 🚧 (Under Development): Intelligent load balancing system
- Adaptive task scheduling and resource allocation
- Automatic load balancing for multi-node training
- Performance optimization for heterogeneous hardware environments
- Colocate Anything: Co-locate reward models with training models to maximize GPU utilization
-
🌐 Comprehensive Multimodal Support
- Native Vision-Language Model (VLM) Training
- Support for mainstream VLMs like Qwen-VL
- Parallel processing of multimodal image-text data
- Efficient multimodal tokenization and batching
- Multimodal Reward Modeling
- Support for multiple visual reward models working in collaboration
- Joint optimization of image understanding and text generation
- Complete Vision-Language Alignment Training Pipeline
- Optimized for multimodal RLVR/RLHF training
- Built-in support for vision-language model fine-tuning
- Native Vision-Language Model (VLM) Training
-
📊 Complete Experimental Toolkit
- Weights & Biases (W&B) integration
- Math capability benchmarking (GSM8K, Geo3K, etc.)
- Trajectory saving and analysis tools
- Automatic checkpoint management
🎯 Supported Algorithms
For detailed algorithm descriptions, implementation details, and usage guide, see Algorithm Documentation.
| Algorithm | Type | Key Improvement | Paper | |-----------|------|-----------------|-------| | GRPO | Policy Optimization | Group normalized advantage estimation | arXiv:2402.03300 | | GSPO | Policy Optimization | Group sequence policy optimization | arXiv:2507.18071 | | GMPO (WIP) | Policy Optimization | Geometric-mean policy optimization | arXiv:2507.20673 | | Dr.GRPO | Policy Optimization | Length bias mitigation | arXiv:2503.20783 | | DAPO | Policy Optimization | Decoupled clip and dynamic sampling policy optimization | arXiv:2503.14476 | | REINFORCE++ | Advantage Estimation | Improved baseline estimation | arXiv:2501.03262 | | CPGD | Advantage Estimation | KL-based drift constraint | arXiv:2505.12504 | | FIRE Sampling | Sampling Strategy | High-temperature first token sampling for improved diversity | arXiv:2410.21236 |
🚀 Quick Start
Requirements
- Python >= 3.12
- CUDA >= 12.8
- PyTorch >= 2.9.1
Docker Images
We provide pre-built Docker images for easy deployment and consistent environments. You can also build your own images using the provided Dockerfile and Makefile.
Using Pre-built Images
The official Docker images are available on Docker Hub. You can pull the latest version using:
docker pull opendilab/lightrft:v0.1.0
To run a container with GPU support:
docker run --gpus all -it --rm \
-v /path/to/your/data:/app/data \
-v /path/to/your/checkpoints:/app/checkpoints \
opendilab/lightrft:v0.1.0 /bin/bash
Building Custom Images
If you need to customize the environment or build from a specific branch, you can use the provided Makefile to build the image locally.
-
Prerequisites: Ensure you have Docker and NVIDIA Container Toolkit installed.
-
Build the image:
# Build the image with the default name (opendilab/lightrft:v${VERSION}) make dbuildThe
IMAGE_NAMEis automatically determined based on the current version of the project. You can also override it:make dbuild IMAGE_NAME=your-custom-tag:latest -
Technical Details:
- Base Image:
nvcr.io/nvidia/pytorch:25.01-py3(includes PyTorch 2.5+ and CUDA 12.8). - Dependencies: The build process installs essential components including
vLLM,DeepSpeed,Flash-Attention, andSGLangin a specific order to ensure stability. - Optimization: The
Dockerfileuses multi-layer optimization and environment variables for non-interactive installation.
- Base Image:
Installation
Standard Installation
LightRFT uses SGLang as the default inference backend with Flash-Attention for optimized performance.
# Clone the repository
git clone https://github.com/opendilab/LightRFT.git
cd LightRFT
# Install LightRFT with all core dependencies
pip install -e .
What gets installed: PyTorch, SGLang, Flash-Attention, Transformers, DeepSpeed, and other core dependencies.
Optional: Install vLLM Backend
If you want to use vLLM instead of (or alongside) SGLang:
# Install vLLM backend
pip install ".[vllm]"
# Or install vLLM directly
pip install vllm>=0.13.3
Troubleshooting Flash-Attention Installation
Flash-Attention is included by default but may fail on some systems due to CUDA compatibility. If installation fails, try:
Option 1: Use pre-built wheels (recommended)
# Download the appropriate wheel from https://github.com/Dao-AILab/flash-attention/releases
# Example for CUDA 12.x and PyTorch 2.9:
pip install flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl
Option 2: Use Docker (easiest)
# The official Docker images include all dependencies
docker pull opendilab/lightrft:v0.1.0
📚 Usage Guide
Basic Example: GRPO Training
# Single node, 8 GPU training example
cd LightRFT
# Run GRPO training (GSM8K math reasoning task)
bash examples/gsm8k_geo3k/run_grpo_gsm8k_qwen2.5_0.5b.sh
# Or run Geo3K geometry problem training (VLM multimodal)
bash examples/gsm8k_geo3k/run_grpo_geo3k_qwen2.5_vl_7b.sh
🏗️ Project Structure
LightRFT/
├── lightrft/ # Core library
│ ├── strategy/ # Training & inference strategies
│ │ ├── fsdp/ # FSDP implementation
│ │ ├── deepspeed/ # DeepSpeed implementation
│ │ ├── vllm_utils/ # vLLM utilities
│ │ ├── sglang_utils/ # SGLang utilities
│ │ └── utils/ # Strategy utilities
│ ├── models/ # Model definitions
│ │ ├── actor_al.py # Audio-language model actor
│ │ ├── actor_language.py # Language model actor
│ │ ├── actor_vl.py # Vision-language model actor
│ │ ├── grm_vl.py # Generative reward model (Vision-Language)
│ │ ├── srm_al.py # Scalar reward model (Audio-Language)
│ │ ├── srm_vl.py # Scalar reward model (Vision-Language)
│ │ ├── loss.py # Loss functions
│ │ ├── monkey_patch/ # Model adaptation patches for distributed training
│ │ ├── tests/ # Model tests
│ │ └── utils.py # Model utilities
│ ├── trainer/ # Trainer implementations
│ │ ├── ppo_trainer.py # LLM PPO trainer
│ │ ├── ppo_trainer_vl.py # VLM PPO trainer
│ │ ├── spmd_ppo_trainer.py # SPMD PPO trainer Extension (**Core**)
│ │ ├── grm_trainer_vl.py # Generative reward model trainer (Vision-Language)
│ │ ├── srm_trainer_al.py # Scalar reward model trainer (Audio-Language)
│ │ ├── srm_trainer_vl.py # Scalar reward model trainer (Vision-Language)
│ │ ├── fast_exp_maker.py # Fast experience generator (**Core**)
│ │ ├── experience_maker.py # Base experience generator
│ │ ├── experience_maker_vl.py # Base experience generator for VLM
│ │ ├── replay_buffer.py # Replay buffer
│ │ ├── replay_buffer_vl.py # VLM replay buffer
│ │ ├── replay_buffer_utils.py # Replay b
