SkillAgentSearch skills...

LightRFT

LightRFT: Light, Efficient, Omni-modal & Reward-model Driven Reinforcement Fine-Tuning Framework

Install / Use

/learn @opendilab/LightRFT
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

LightRFT

<div align="center"> <img src="assets/logo.png" alt="LightRFT Logo" width="600"/>

Light, Efficient, Omni-modal & Reward-model Driven Reinforcement Fine-Tuning Framework

Version Python PyTorch License

English | 简体中文

</div>

📖 Introduction

LightRFT (Light Reinforcement Fine-Tuning) is an advanced reinforcement learning fine-tuning framework designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). This framework provides efficient and scalable RLHF (Reinforcement Learning from Human Feedback) and RLVR training capabilities, supporting multiple state-of-the-art algorithms and distributed training strategies.

✨ Key Features

  • 🚀 High-Performance Inference Engines

    • Integrated vLLM and SGLang for efficient sampling and inference
    • FP8 inference optimization for significantly reduced latency and memory usage
    • Flexible engine sleep/wake mechanisms for optimal resource utilization
  • 🧠 Rich Algorithm Ecosystem

    • Policy Optimization: GRPO, GSPO, GMPO, Dr.GRPO
    • Advantage Estimation: REINFORCE++, CPGD
    • Reward Processing: Reward Norm/Clip
    • Sampling Strategy: FIRE Sampling, Token-Level Policy
    • Stability Enhancement: DAPO, select_high_entropy_tokens
  • 🔧 Flexible Training Strategies

    • FSDP (Fully Sharded Data Parallel) v2 support
    • DeepSpeed ZeRO (Stage 1/2/3) support
    • Gradient checkpointing and mixed precision training (BF16/FP16)
    • Adam Offload and memory optimization techniques
  • 🎯 Innovative Resource Collaboration

    • Colocate Anything: Co-locate reward models with training models to maximize GPU utilization
      • Support multiple reward models for parallel inference on the same device
      • Dynamic memory management with automatic training/inference phase switching
      • Reduced cross-device communication overhead for improved end-to-end training efficiency
    • Balance Anything 🚧 (Under Development): Intelligent load balancing system
      • Adaptive task scheduling and resource allocation
      • Automatic load balancing for multi-node training
      • Performance optimization for heterogeneous hardware environments
  • 🌐 Comprehensive Multimodal Support

    • Native Vision-Language Model (VLM) Training
      • Support for mainstream VLMs like Qwen-VL
      • Parallel processing of multimodal image-text data
      • Efficient multimodal tokenization and batching
    • Multimodal Reward Modeling
      • Support for multiple visual reward models working in collaboration
      • Joint optimization of image understanding and text generation
    • Complete Vision-Language Alignment Training Pipeline
      • Optimized for multimodal RLVR/RLHF training
      • Built-in support for vision-language model fine-tuning
  • 📊 Complete Experimental Toolkit

    • Weights & Biases (W&B) integration
    • Math capability benchmarking (GSM8K, Geo3K, etc.)
    • Trajectory saving and analysis tools
    • Automatic checkpoint management

🎯 Supported Algorithms

For detailed algorithm descriptions, implementation details, and usage guide, see Algorithm Documentation.

| Algorithm | Type | Key Improvement | Paper | |-----------|------|-----------------|-------| | GRPO | Policy Optimization | Group normalized advantage estimation | arXiv:2402.03300 | | GSPO | Policy Optimization | Group sequence policy optimization | arXiv:2507.18071 | | GMPO (WIP) | Policy Optimization | Geometric-mean policy optimization | arXiv:2507.20673 | | Dr.GRPO | Policy Optimization | Length bias mitigation | arXiv:2503.20783 | | DAPO | Policy Optimization | Decoupled clip and dynamic sampling policy optimization | arXiv:2503.14476 | | REINFORCE++ | Advantage Estimation | Improved baseline estimation | arXiv:2501.03262 | | CPGD | Advantage Estimation | KL-based drift constraint | arXiv:2505.12504 | | FIRE Sampling | Sampling Strategy | High-temperature first token sampling for improved diversity | arXiv:2410.21236 |


🚀 Quick Start

Requirements

  • Python >= 3.12
  • CUDA >= 12.8
  • PyTorch >= 2.9.1

Docker Images

We provide pre-built Docker images for easy deployment and consistent environments. You can also build your own images using the provided Dockerfile and Makefile.

Using Pre-built Images

The official Docker images are available on Docker Hub. You can pull the latest version using:

docker pull opendilab/lightrft:v0.1.0

To run a container with GPU support:

docker run --gpus all -it --rm \
    -v /path/to/your/data:/app/data \
    -v /path/to/your/checkpoints:/app/checkpoints \
    opendilab/lightrft:v0.1.0 /bin/bash

Building Custom Images

If you need to customize the environment or build from a specific branch, you can use the provided Makefile to build the image locally.

  1. Prerequisites: Ensure you have Docker and NVIDIA Container Toolkit installed.

  2. Build the image:

    # Build the image with the default name (opendilab/lightrft:v${VERSION})
    make dbuild
    

    The IMAGE_NAME is automatically determined based on the current version of the project. You can also override it:

    make dbuild IMAGE_NAME=your-custom-tag:latest
    
  3. Technical Details:

    • Base Image: nvcr.io/nvidia/pytorch:25.01-py3 (includes PyTorch 2.5+ and CUDA 12.8).
    • Dependencies: The build process installs essential components including vLLM, DeepSpeed, Flash-Attention, and SGLang in a specific order to ensure stability.
    • Optimization: The Dockerfile uses multi-layer optimization and environment variables for non-interactive installation.

Installation

Standard Installation

LightRFT uses SGLang as the default inference backend with Flash-Attention for optimized performance.

# Clone the repository
git clone https://github.com/opendilab/LightRFT.git
cd LightRFT

# Install LightRFT with all core dependencies
pip install -e .

What gets installed: PyTorch, SGLang, Flash-Attention, Transformers, DeepSpeed, and other core dependencies.

Optional: Install vLLM Backend

If you want to use vLLM instead of (or alongside) SGLang:

# Install vLLM backend
pip install ".[vllm]"

# Or install vLLM directly
pip install vllm>=0.13.3

Troubleshooting Flash-Attention Installation

Flash-Attention is included by default but may fail on some systems due to CUDA compatibility. If installation fails, try:

Option 1: Use pre-built wheels (recommended)

# Download the appropriate wheel from https://github.com/Dao-AILab/flash-attention/releases
# Example for CUDA 12.x and PyTorch 2.9:
pip install flash_attn-2.8.3+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl

Option 2: Use Docker (easiest)

# The official Docker images include all dependencies
docker pull opendilab/lightrft:v0.1.0

📚 Usage Guide

Basic Example: GRPO Training

# Single node, 8 GPU training example
cd LightRFT

# Run GRPO training (GSM8K math reasoning task)
bash examples/gsm8k_geo3k/run_grpo_gsm8k_qwen2.5_0.5b.sh

# Or run Geo3K geometry problem training (VLM multimodal)
bash examples/gsm8k_geo3k/run_grpo_geo3k_qwen2.5_vl_7b.sh

🏗️ Project Structure

LightRFT/
├── lightrft/                      # Core library
│   ├── strategy/                  # Training & inference strategies
│   │   ├── fsdp/                  # FSDP implementation
│   │   ├── deepspeed/             # DeepSpeed implementation
│   │   ├── vllm_utils/            # vLLM utilities
│   │   ├── sglang_utils/          # SGLang utilities
│   │   └── utils/                 # Strategy utilities
│   ├── models/                    # Model definitions
│   │   ├── actor_al.py            # Audio-language model actor
│   │   ├── actor_language.py      # Language model actor
│   │   ├── actor_vl.py            # Vision-language model actor
│   │   ├── grm_vl.py              # Generative reward model (Vision-Language)
│   │   ├── srm_al.py              # Scalar reward model (Audio-Language)
│   │   ├── srm_vl.py              # Scalar reward model (Vision-Language)
│   │   ├── loss.py                # Loss functions
│   │   ├── monkey_patch/          # Model adaptation patches for distributed training
│   │   ├── tests/                 # Model tests
│   │   └── utils.py               # Model utilities
│   ├── trainer/                   # Trainer implementations
│   │   ├── ppo_trainer.py         # LLM PPO trainer
│   │   ├── ppo_trainer_vl.py      # VLM PPO trainer
│   │   ├── spmd_ppo_trainer.py    # SPMD PPO trainer Extension (**Core**)
│   │   ├── grm_trainer_vl.py      # Generative reward model trainer (Vision-Language)
│   │   ├── srm_trainer_al.py      # Scalar reward model trainer (Audio-Language)
│   │   ├── srm_trainer_vl.py      # Scalar reward model trainer (Vision-Language)
│   │   ├── fast_exp_maker.py      # Fast experience generator (**Core**)
│   │   ├── experience_maker.py    # Base experience generator
│   │   ├── experience_maker_vl.py # Base experience generator for VLM
│   │   ├── replay_buffer.py       # Replay buffer
│   │   ├── replay_buffer_vl.py    # VLM replay buffer
│   │   ├── replay_buffer_utils.py # Replay b
View on GitHub
GitHub Stars258
CategoryEducation
Updated1d ago
Forks10

Languages

Python

Security Score

100/100

Audited on Mar 31, 2026

No findings