TorchRL

TorchRL is an open-source Reinforcement Learning (RL) library for PyTorch.

🚀 What's New

🚀 Command-Line Training Interface - Train RL Agents Without Writing Code! (Experimental)

TorchRL now provides a powerful command-line interface that lets you train state-of-the-art RL agents with simple bash commands! No Python scripting required - just run training with customizable parameters:

🎯 One-Command Training: python sota-implementations/ppo_trainer/train.py
⚙️ Full Customization: Override any parameter via command line: trainer.total_frames=2000000 optimizer.lr=0.0003
🌍 Multi-Environment Support: Switch between Gym, Brax, DM Control, and more with env=gym training_env.create_env_fn.base_env.env_name=HalfCheetah-v4
📊 Built-in Logging: TensorBoard, Weights & Biases, CSV logging out of the box
🔧 Hydra-Powered: Leverages Hydra's powerful configuration system for maximum flexibility
🏃‍♂️ Production Ready: Same robust training pipeline as our SOTA implementations

Perfect for: Researchers, practitioners, and anyone who wants to train RL agents without diving into implementation details.

⚠️ Note: This is an experimental feature. The API may change in future versions. We welcome feedback and contributions to help improve this implementation!

📋 Prerequisites: The training interface requires Hydra for configuration management. Install with:

pip install "torchrl[utils]"
# or manually:
pip install hydra-core omegaconf

Check out the complete CLI documentation to get started!

🚀 vLLM Revamp - Major Enhancement to LLM Infrastructure (v0.10)

This release introduces a comprehensive revamp of TorchRL's vLLM integration, delivering significant improvements in performance, scalability, and usability for large language model inference and training workflows:

🔥 AsyncVLLM Service: Production-ready distributed vLLM inference with multi-replica scaling and automatic Ray actor management
⚖️ Multiple Load Balancing Strategies: Routing strategies including prefix-aware, request-based, and KV-cache load balancing for optimal performance
🏗️ Unified vLLM Architecture: New RLvLLMEngine interface standardizing all vLLM backends with simplified vLLMUpdaterV2 for seamless weight updates
🌐 Distributed Data Loading: New RayDataLoadingPrimer for shared, distributed data loading across multiple environments
📈 Enhanced Performance: Native vLLM batching, concurrent request processing, and optimized resource allocation via Ray placement groups

# Simple AsyncVLLM usage - production ready!
from torchrl.modules.llm import AsyncVLLM, vLLMWrapper

# Create distributed vLLM service with load balancing
service = AsyncVLLM.from_pretrained(
    "Qwen/Qwen2.5-7B",
    num_devices=2,      # Tensor parallel across 2 GPUs
    num_replicas=4,     # 4 replicas for high throughput
    max_model_len=4096
)

# Use with TorchRL's LLM wrappers
wrapper = vLLMWrapper(service, input_mode="history")

# Simplified weight updates
from torchrl.collectors.llm import vLLMUpdaterV2
updater = vLLMUpdaterV2(service)  # Auto-configures from engine

This revamp positions TorchRL as the leading platform for scalable LLM inference and training, providing production-ready tools for both research and deployment scenarios.

🧪 PPOTrainer (Experimental) - High-Level Training Interface

TorchRL now includes an experimental PPOTrainer that provides a complete, configurable PPO training solution! This prototype feature combines TorchRL's modular components into a cohesive training system with sensible defaults:

🎯 Complete Training Pipeline: Handles environment setup, data collection, loss computation, and optimization automatically
⚙️ Extensive Configuration: Comprehensive Hydra-based config system for easy experimentation and hyperparameter tuning
📊 Built-in Logging: Automatic tracking of rewards, actions, episode completion rates, and training statistics
🔧 Modular Design: Built on existing TorchRL components (collectors, losses, replay buffers) for maximum flexibility
📝 Minimal Code: Complete SOTA implementation in just ~20 lines!

Working Example: See sota-implementations/ppo_trainer/ for a complete, working PPO implementation that trains on Pendulum-v1 with full Hydra configuration support.

Prerequisites: Requires Hydra for configuration management: pip install "torchrl[utils]"

<details> <summary>Complete Training Script (sota-implementations/ppo_trainer/train.py)</summary>

import hydra
from torchrl.trainers.algorithms.configs import *

@hydra.main(config_path="config", config_name="config", version_base="1.1")
def main(cfg):
    trainer = hydra.utils.instantiate(cfg.trainer)
    trainer.train()

if __name__ == "__main__":
    main()

Complete PPO training in ~20 lines with full configurability.

</details> <details> <summary>API Usage Examples</summary>

# Basic usage - train PPO on Pendulum-v1 with default settings
python sota-implementations/ppo_trainer/train.py

# Custom configuration with command-line overrides
python sota-implementations/ppo_trainer/train.py \
    trainer.total_frames=2000000 \
    training_env.create_env_fn.base_env.env_name=HalfCheetah-v4 \
    networks.policy_network.num_cells=[256,256] \
    optimizer.lr=0.0003

# Use different environment and logger
python sota-implementations/ppo_trainer/train.py \
    env=gym \
    training_env.create_env_fn.base_env.env_name=Walker2d-v4 \
    logger=tensorboard

# See all available options
python sota-implementations/ppo_trainer/train.py --help

</details>

Future Plans: Additional algorithm trainers (SAC, TD3, DQN) and full integration of all TorchRL components within the configuration system are planned for upcoming releases.

LLM API - Complete Framework for Language Model Fine-tuning

TorchRL includes a comprehensive LLM API for post-training and fine-tuning of language models! This framework provides everything you need for RLHF, supervised fine-tuning, and tool-augmented training:

🤖 Unified LLM Wrappers: Seamless integration with Hugging Face models and vLLM inference engines
💬 Conversation Management: Advanced History class for multi-turn dialogue with automatic chat template detection
🛠️ Tool Integration: Built-in support for Python code execution, function calling, and custom tool transforms
🎯 Specialized Objectives: GRPO (Group Relative Policy Optimization) and SFT loss functions optimized for language models
⚡ High-Performance Collectors: Async data collection with distributed training support
🔄 Flexible Environments: Transform-based architecture for reward computation, data loading, and conversation augmentation

The LLM API follows TorchRL's modular design principles, allowing you to mix and match components for your specific use case. Check out the complete documentation and GRPO implementation example to

Rl

Install / Use

README