SkillAgentSearch skills...

KernelGYM

[KernelGYM & Dr. Kernel] A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Install / Use

/learn @hkust-nlp/KernelGYM
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

KernelGYM: A Gym for Kernel Generations

Dataset Models arXiv Discord

KernelGYM is a GPU-distributed environment designed for evaluating and training AI models on GPU kernel generation tasks. This repository also includes our validated RL training methods from our paper:

Dr.Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

See drkernel/ for training implementation details and recipe.

What Can You Do with KernelGYM?

  • Long-Horizon RL Training - Train kernel generation models with multi-turn rollouts, reward hacking detection, and detailed profiling metrics
  • Agentic Trajectory Collection - Collect high-quality training data from agent interactions with the kernel evaluation environment
  • Large-Scale Kernel Optimization - Deploy your own agents to optimize kernel implementations across thousands of tasks in parallel
  • Parallel Kernel Evaluation - Evaluate kernel correctness and performance across distributed GPU clusters with automatic error recovery
  • Extend to New GPU Tasks - Build on our abstractions to support custom GPU workloads beyond kernel generation (physics simulation, rendering, etc.)

Table of Contents


Overview

Training AI models to generate optimized GPU kernels presents unique challenges that standard code generation environments cannot address. KernelGYM provides:

  • GPU-Distributed Task Scheduling: Efficiently manage and distribute kernel evaluation tasks across multiple GPUs and nodes
  • Subprocess Isolation: CUDA error isolation through subprocess worker pools prevents GPU crashes from affecting the main evaluation pipeline
  • Multi-Backend Support: Evaluate kernels written in CUDA, Triton, and other frameworks
  • RL Training Integration: Seamlessly integrate with RL training pipelines (VERL framework) for reward computation
  • Extensible Architecture: Abstract interfaces for backends, toolkits, and workflows enable easy extension to new GPU tasks

Key Challenges & Highlights

Challenges in Kernel Generation Evaluation

  1. GPU Resource Management: Unlike CPU-bound code execution, kernel evaluation requires careful GPU memory management and device isolation. A single CUDA error can crash an entire evaluation pipeline.

  2. Performance Measurement Complexity: Accurate kernel timing requires proper warmup, multiple trials, and CUDA event synchronization - not just wall-clock timing.

  3. Correctness Verification: Kernels must produce numerically correct results within floating-point tolerances, requiring reference implementations for comparison.

  4. Scalability: Training requires evaluating thousands of kernel samples per batch, necessitating distributed evaluation across multiple GPUs and nodes.

KernelGYM's Solutions

  • Subprocess Worker Pool: Each GPU worker maintains a pool of subprocess workers. CUDA errors are isolated to subprocesses and automatically recovered, preventing cascading failures.

  • Precise Timing Infrastructure: Built-in CUDA event-based timing with configurable warmup and trial counts ensures accurate performance measurements.

  • Flexible Correctness Checking: Support for custom tolerance levels (rtol/atol), multiple test cases, and decoy kernel detection.

  • Horizontal Scalability: Redis-based task queue enables seamless scaling from single-GPU to multi-node deployments.

Architecture

<p align="center"> <img src="assets/architecture_figure1_hd.png" width="980" alt="KernelGYM architecture"> </p>

High-level architecture of KernelGYM: API server + task manager + distributed GPU workers with subprocess isolation.

┌─────────────────────────────────────────────────────────────────────┐
│                          KernelGYM Architecture                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────────────┐   │
│  │   Client     │───▶│  API Server  │───▶│   Task Manager       │   │
│  │  (Training)  │    │  (FastAPI)   │    │   (Redis Queue)      │   │
│  └──────────────┘    └──────────────┘    └──────────┬───────────┘   │
│                                                      │               │
│                      ┌───────────────────────────────┴───────────┐   │
│                      │              Worker Layer                  │   │
│  ┌───────────────────┼───────────────────────────────────────────┤   │
│  │                   ▼                                           │   │
│  │  ┌─────────────────────┐  ┌─────────────────────┐             │   │
│  │  │   GPU Worker 0      │  │   GPU Worker N      │             │   │
│  │  │  ┌───────────────┐  │  │  ┌───────────────┐  │             │   │
│  │  │  │ Subprocess    │  │  │  │ Subprocess    │  │    ...      │   │
│  │  │  │ Pool (CUDA    │  │  │  │ Pool (CUDA    │  │             │   │
│  │  │  │ Isolation)    │  │  │  │ Isolation)    │  │             │   │
│  │  │  └───────────────┘  │  │  └───────────────┘  │             │   │
│  │  └─────────────────────┘  └─────────────────────┘             │   │
│  │                                                               │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                                                                      │
│  ┌───────────────────────────────────────────────────────────────┐   │
│  │                     Core Components                            │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐    │   │
│  │  │  Backends   │  │  Toolkits   │  │  Workflow Controllers│    │   │
│  │  │  (Compile/  │  │  (Evaluate) │  │  (Orchestrate)       │    │   │
│  │  │   Load/Run) │  │             │  │                      │    │   │
│  │  └─────────────┘  └─────────────┘  └─────────────────────┘    │   │
│  └───────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Core Components

| Component | Description | |-----------|-------------| | Backend | Handles kernel compilation, loading, and execution. Abstracts different kernel frameworks (CUDA, Triton). | | Toolkit | Implements evaluation logic - correctness checking, performance measurement, profiling. | | Workflow Controller | Orchestrates multi-step evaluation workflows (e.g., reference timing + kernel evaluation). | | Task Manager | Redis-based queue management with priority scheduling and worker assignment. | | GPU Worker | Manages task execution on a specific GPU with subprocess isolation for error recovery. |

Installation

Prerequisites

  • Python 3.10+
  • CUDA 11.8+ with compatible GPU
  • Redis server
  • Linux (Ubuntu 20.04+ recommended)

Setup

# Clone the repository
git clone https://github.com/hkust-nlp/KernelGYM.git
cd kernelgym

# Run the setup script
bash setup.sh

The setup script will:

  1. Install Python dependencies from requirements.txt
  2. Install additional packages (pydantic-settings)
  3. Install system utilities (iproute2)
  4. Install Redis for local deployment

Configuration

You have two options:

  1. Recommended: let KernelGYM auto-generate .env on first startup.
  2. Manual: create .env yourself with explicit values.

Manual .env example:

# Redis Configuration
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=  # Optional

# API Server Configuration
API_HOST=0.0.0.0
API_PORT=10907

# GPU Configuration
GPU_DEVICES=[0,1,2,3,4,5,6,7]  # List of GPU indices to use

# Optional: Node identification for multi-node setup
NODE_ID=node-1

Quick Start

Fastest Path (Single Node)

If your goal is to get a local KernelGYM service up quickly:

# 1) Install dependencies
bash setup.sh

# 2) Auto-configure .env (or skip; start script can do this automatically)
bash scripts/auto_configure.sh

# 3) Start API + workers
./start_all_with_monitor.sh

# 4) Verify service
curl http://localhost:10907/health
curl http://localhost:10907/workers/status

scripts/auto_configure.sh includes a practical example strategy for fast setup:

  • Detect host/IP and GPU list
  • Find available ports for Redis/API/Metrics from candidate port pools

Treat this as a reference implementation, not a fixed policy. If your machine or cluster has different networking constraints, edit the port-selection logic in scripts/auto_configure.sh (for example, custom port ranges, reserved ports, or service-specific pinning) to match

View on GitHub
GitHub Stars153
CategoryCustomer
Updated2d ago
Forks10

Languages

Python

Security Score

80/100

Audited on Mar 26, 2026

No findings