KernelGYM: A Gym for Kernel Generations

KernelGYM is a GPU-distributed environment designed for evaluating and training AI models on GPU kernel generation tasks. This repository also includes our validated RL training methods from our paper:

Dr.Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

See drkernel/ for training implementation details and recipe.

What Can You Do with KernelGYM?

Long-Horizon RL Training - Train kernel generation models with multi-turn rollouts, reward hacking detection, and detailed profiling metrics
Agentic Trajectory Collection - Collect high-quality training data from agent interactions with the kernel evaluation environment
Large-Scale Kernel Optimization - Deploy your own agents to optimize kernel implementations across thousands of tasks in parallel
Parallel Kernel Evaluation - Evaluate kernel correctness and performance across distributed GPU clusters with automatic error recovery
Extend to New GPU Tasks - Build on our abstractions to support custom GPU workloads beyond kernel generation (physics simulation, rendering, etc.)

Overview
Key Challenges & Highlights
Architecture
Installation
Quick Start
Usage Examples
- Example 1: Kernel Simple Task
- Example 2: KernelBench Evaluation
Supported Backends & Toolkits
Extending KernelGYM
- Adding New GPU Tasks
- Tutorial: Kernel Simple as Extension Example
Training Recipe
API Reference

Overview

Training AI models to generate optimized GPU kernels presents unique challenges that standard code generation environments cannot address. KernelGYM provides:

GPU-Distributed Task Scheduling: Efficiently manage and distribute kernel evaluation tasks across multiple GPUs and nodes
Subprocess Isolation: CUDA error isolation through subprocess worker pools prevents GPU crashes from affecting the main evaluation pipeline
Multi-Backend Support: Evaluate kernels written in CUDA, Triton, and other frameworks
RL Training Integration: Seamlessly integrate with RL training pipelines (VERL framework) for reward computation
Extensible Architecture: Abstract interfaces for backends, toolkits, and workflows enable easy extension to new GPU tasks

Key Challenges & Highlights

Challenges in Kernel Generation Evaluation

GPU Resource Management: Unlike CPU-bound code execution, kernel evaluation requires careful GPU memory management and device isolation. A single CUDA error can crash an entire evaluation pipeline.
Performance Measurement Complexity: Accurate kernel timing requires proper warmup, multiple trials, and CUDA event synchronization - not just wall-clock timing.
Correctness Verification: Kernels must produce numerically correct results within floating-point tolerances, requiring reference implementations for comparison.
Scalability: Training requires evaluating thousands of kernel samples per batch, necessitating distributed evaluation across multiple GPUs and nodes.

KernelGYM's Solutions

Subprocess Worker Pool: Each GPU worker maintains a pool of subprocess workers. CUDA errors are isolated to subprocesses and automatically recovered, preventing cascading failures.
Precise Timing Infrastructure: Built-in CUDA event-based timing with configurable warmup and trial counts ensures accurate performance measurements.
Flexible Correctness Checking: Support for custom tolerance levels (rtol/atol), multiple test cases, and decoy kernel detection.
Horizontal Scalability: Redis-based task queue enables seamless scaling from single-GPU to multi-node deployments.

Architecture

High-level architecture of KernelGYM: API server + task manager + distributed GPU workers with subprocess isolation.

┌─────────────────────────────────────────────────────────────────────┐
│                          KernelGYM Architecture                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐   │
│  │   Client     │───▶│  API Server  │───▶│   Task Manager       │   │
│  │  (Training)  │    │  (FastAPI)   │    │   (Redis Queue)      │   │
│  └──────────────┘    └──────────────┘    └──────────┬───────────┘   │
│                                                      │               │
│                      ┌───────────────────────────────┴───────────┐   │
│                      │              Worker Layer                  │   │
│  ┌───────────────────┼───────────────────────────────────────────┤   │
│  │                   ▼                                           │   │
│  │  ┌─────────────────────┐  ┌─────────────────────┐             │   │
│  │  │   GPU Worker 0      │  │   GPU Worker N      │             │   │
│  │  │  ┌───────────────┐  │  │  ┌───────────────┐  │             │   │
│  │  │  │ Subprocess    │  │  │  │ Subprocess    │  │    ...      │   │
│  │  │  │ Pool (CUDA    │  │  │  │ Pool (CUDA    │  │             │   │
│  │  │  │ Isolation)    │  │  │  │ Isolation)    │  │             │   │
│  │  │  └───────────────┘  │  │  └───────────────┘  │             │   │
│  │  └─────────────────────┘  └─────────────────────┘             │   │
│  │                                                               │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │                     Core Components                            │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐    │   │
│  │  │  Backends   │  │  Toolkits   │  │  Workflow Controllers│    │   │
│  │  │  (Compile/  │  │  (Evaluate) │  │  (Orchestrate)       │    │   │
│  │  │   Load/Run) │  │             │  │                      │    │   │
│  │  └─────────────┘  └─────────────┘  └─────────────────────┘    │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Core Components

| Component | Description | |-----------|-------------| | Backend | Handles kernel compilation, loading, and execution. Abstracts different kernel frameworks (CUDA, Triton). | | Toolkit | Implements evaluation logic - correctness checking, performance measurement, profiling. | | Workflow Controller | Orchestrates multi-step evaluation workflows (e.g., reference timing + kernel evaluation). | | Task Manager | Redis-based queue management with priority scheduling and worker assignment. | | GPU Worker | Manages task execution on a specific GPU with subprocess isolation for error recovery. |

Installation

Prerequisites

Python 3.10+
CUDA 11.8+ with compatible GPU
Redis server
Linux (Ubuntu 20.04+ recommended)

Setup

# Clone the repository
git clone https://github.com/hkust-nlp/KernelGYM.git
cd kernelgym

# Run the setup script
bash setup.sh

The setup script will:

Install Python dependencies from requirements.txt
Install additional packages (pydantic-settings)
Install system utilities (iproute2)
Install Redis for local deployment

Configuration

You have two options:

Recommended: let KernelGYM auto-generate .env on first startup.
Manual: create .env yourself with explicit values.

Manual .env example:

# Redis Configuration
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=  # Optional

# API Server Configuration
API_HOST=0.0.0.0
API_PORT=10907

# GPU Configuration
GPU_DEVICES=[0,1,2,3,4,5,6,7]  # List of GPU indices to use

# Optional: Node identification for multi-node setup
NODE_ID=node-1

Quick Start

Fastest Path (Single Node)

If your goal is to get a local KernelGYM service up quickly:

# 1) Install dependencies
bash setup.sh

# 2) Auto-configure .env (or skip; start script can do this automatically)
bash scripts/auto_configure.sh

# 3) Start API + workers
./start_all_with_monitor.sh

# 4) Verify service
curl http://localhost:10907/health
curl http://localhost:10907/workers/status

scripts/auto_configure.sh includes a practical example strategy for fast setup:

Detect host/IP and GPU list
Find available ports for Redis/API/Metrics from candidate port pools

Treat this as a reference implementation, not a fixed policy. If your machine or cluster has different networking constraints, edit the port-selection logic in scripts/auto_configure.sh (for example, custom port ranges, reserved ports, or service-specific pinning) to match

KernelGYM

Install / Use

README