KernelGYM
[KernelGYM & Dr. Kernel] A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations
Install / Use
/learn @hkust-nlp/KernelGYMREADME
KernelGYM: A Gym for Kernel Generations
KernelGYM is a GPU-distributed environment designed for evaluating and training AI models on GPU kernel generation tasks. This repository also includes our validated RL training methods from our paper:
Dr.Kernel: Reinforcement Learning Done Right for Triton Kernel Generations
See drkernel/ for training implementation details and recipe.
What Can You Do with KernelGYM?
- Long-Horizon RL Training - Train kernel generation models with multi-turn rollouts, reward hacking detection, and detailed profiling metrics
- Agentic Trajectory Collection - Collect high-quality training data from agent interactions with the kernel evaluation environment
- Large-Scale Kernel Optimization - Deploy your own agents to optimize kernel implementations across thousands of tasks in parallel
- Parallel Kernel Evaluation - Evaluate kernel correctness and performance across distributed GPU clusters with automatic error recovery
- Extend to New GPU Tasks - Build on our abstractions to support custom GPU workloads beyond kernel generation (physics simulation, rendering, etc.)
Table of Contents
- Overview
- Key Challenges & Highlights
- Architecture
- Installation
- Quick Start
- Usage Examples
- Supported Backends & Toolkits
- Extending KernelGYM
- Training Recipe
- API Reference
Overview
Training AI models to generate optimized GPU kernels presents unique challenges that standard code generation environments cannot address. KernelGYM provides:
- GPU-Distributed Task Scheduling: Efficiently manage and distribute kernel evaluation tasks across multiple GPUs and nodes
- Subprocess Isolation: CUDA error isolation through subprocess worker pools prevents GPU crashes from affecting the main evaluation pipeline
- Multi-Backend Support: Evaluate kernels written in CUDA, Triton, and other frameworks
- RL Training Integration: Seamlessly integrate with RL training pipelines (VERL framework) for reward computation
- Extensible Architecture: Abstract interfaces for backends, toolkits, and workflows enable easy extension to new GPU tasks
Key Challenges & Highlights
Challenges in Kernel Generation Evaluation
-
GPU Resource Management: Unlike CPU-bound code execution, kernel evaluation requires careful GPU memory management and device isolation. A single CUDA error can crash an entire evaluation pipeline.
-
Performance Measurement Complexity: Accurate kernel timing requires proper warmup, multiple trials, and CUDA event synchronization - not just wall-clock timing.
-
Correctness Verification: Kernels must produce numerically correct results within floating-point tolerances, requiring reference implementations for comparison.
-
Scalability: Training requires evaluating thousands of kernel samples per batch, necessitating distributed evaluation across multiple GPUs and nodes.
KernelGYM's Solutions
-
Subprocess Worker Pool: Each GPU worker maintains a pool of subprocess workers. CUDA errors are isolated to subprocesses and automatically recovered, preventing cascading failures.
-
Precise Timing Infrastructure: Built-in CUDA event-based timing with configurable warmup and trial counts ensures accurate performance measurements.
-
Flexible Correctness Checking: Support for custom tolerance levels (rtol/atol), multiple test cases, and decoy kernel detection.
-
Horizontal Scalability: Redis-based task queue enables seamless scaling from single-GPU to multi-node deployments.
Architecture
<p align="center"> <img src="assets/architecture_figure1_hd.png" width="980" alt="KernelGYM architecture"> </p>High-level architecture of KernelGYM: API server + task manager + distributed GPU workers with subprocess isolation.
┌─────────────────────────────────────────────────────────────────────┐
│ KernelGYM Architecture │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Client │───▶│ API Server │───▶│ Task Manager │ │
│ │ (Training) │ │ (FastAPI) │ │ (Redis Queue) │ │
│ └──────────────┘ └──────────────┘ └──────────┬───────────┘ │
│ │ │
│ ┌───────────────────────────────┴───────────┐ │
│ │ Worker Layer │ │
│ ┌───────────────────┼───────────────────────────────────────────┤ │
│ │ ▼ │ │
│ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │
│ │ │ GPU Worker 0 │ │ GPU Worker N │ │ │
│ │ │ ┌───────────────┐ │ │ ┌───────────────┐ │ │ │
│ │ │ │ Subprocess │ │ │ │ Subprocess │ │ ... │ │
│ │ │ │ Pool (CUDA │ │ │ │ Pool (CUDA │ │ │ │
│ │ │ │ Isolation) │ │ │ │ Isolation) │ │ │ │
│ │ │ └───────────────┘ │ │ └───────────────┘ │ │ │
│ │ └─────────────────────┘ └─────────────────────┘ │ │
│ │ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Core Components │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │ │
│ │ │ Backends │ │ Toolkits │ │ Workflow Controllers│ │ │
│ │ │ (Compile/ │ │ (Evaluate) │ │ (Orchestrate) │ │ │
│ │ │ Load/Run) │ │ │ │ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Core Components
| Component | Description | |-----------|-------------| | Backend | Handles kernel compilation, loading, and execution. Abstracts different kernel frameworks (CUDA, Triton). | | Toolkit | Implements evaluation logic - correctness checking, performance measurement, profiling. | | Workflow Controller | Orchestrates multi-step evaluation workflows (e.g., reference timing + kernel evaluation). | | Task Manager | Redis-based queue management with priority scheduling and worker assignment. | | GPU Worker | Manages task execution on a specific GPU with subprocess isolation for error recovery. |
Installation
Prerequisites
- Python 3.10+
- CUDA 11.8+ with compatible GPU
- Redis server
- Linux (Ubuntu 20.04+ recommended)
Setup
# Clone the repository
git clone https://github.com/hkust-nlp/KernelGYM.git
cd kernelgym
# Run the setup script
bash setup.sh
The setup script will:
- Install Python dependencies from
requirements.txt - Install additional packages (pydantic-settings)
- Install system utilities (iproute2)
- Install Redis for local deployment
Configuration
You have two options:
- Recommended: let KernelGYM auto-generate
.envon first startup. - Manual: create
.envyourself with explicit values.
Manual .env example:
# Redis Configuration
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD= # Optional
# API Server Configuration
API_HOST=0.0.0.0
API_PORT=10907
# GPU Configuration
GPU_DEVICES=[0,1,2,3,4,5,6,7] # List of GPU indices to use
# Optional: Node identification for multi-node setup
NODE_ID=node-1
Quick Start
Fastest Path (Single Node)
If your goal is to get a local KernelGYM service up quickly:
# 1) Install dependencies
bash setup.sh
# 2) Auto-configure .env (or skip; start script can do this automatically)
bash scripts/auto_configure.sh
# 3) Start API + workers
./start_all_with_monitor.sh
# 4) Verify service
curl http://localhost:10907/health
curl http://localhost:10907/workers/status
scripts/auto_configure.sh includes a practical example strategy for fast setup:
- Detect host/IP and GPU list
- Find available ports for Redis/API/Metrics from candidate port pools
Treat this as a reference implementation, not a fixed policy. If your machine or cluster has different networking constraints, edit the port-selection logic in scripts/auto_configure.sh (for example, custom port ranges, reserved ports, or service-specific pinning) to match
