TritonForge

🔥 Forging Optimal GPU Kernels through SFT + RL

Transform PyTorch Operations into Optimized GPU Kernels with LLMs

</div>

🌟 Highlights

| Feature | Description | |---------|------------| | 🎓 Two-Stage Training | SFT on high-quality datasets followed by RL optimization | | 🔄 Multi-Turn Refinement | Iterative kernel improvement through compilation feedback | | ⚡ Cross-Platform | Support for both NVIDIA CUDA and AMD ROCm GPUs | | 📈 Performance Metrics | Comprehensive evaluation of correctness and speedup | | 🧪 200+ Benchmarks | Extensive test suite across multiple difficulty levels |

</div>

📰 News

[2025/10/09] 🎉 We just gave a talk about TritonForge as a guest speaker for Li Lab @ CMU! Slide here if u feel interested~
[2025/09/29] 🎉 We released both English and Chinese versions of the TritonForge Tech Blog! English version | Chinese version (中文版)

🎯 Overview

TritonForge is an advanced machine learning framework that trains Large Language Models (LLMs) to automatically convert PyTorch operations into optimized Triton GPU kernels. By combining supervised fine-tuning (SFT) with reinforcement learning (RL), TritonForge achieves state-of-the-art performance in automated kernel generation.

🏗️ Architecture Deep Dive: For a comprehensive understanding of our server-based SFT + RL framework, evaluation infrastructure, and cross-platform support, see our Architecture Documentation.

🌍 Fully Open-Source Initiative

We believe in complete transparency and community collaboration. Everything is open-source:

📚 Training Data: Custom-curated datasets (GPUMODE/KernelBook)
🤖 Model Checkpoints: All intermediate and final models (HuggingFace)
🏗️ Training Framework: Complete SLIME RL implementation (fixed version with improvements)
🐳 Environment Setup: Docker images and configurations for both NVIDIA and AMD
📖 Training Recipes: Detailed scripts and hyperparameters for reproduction

We invite the community to join us in advancing automated kernel generation together!

🧠 SLIME

Reinforcement Learning Framework

Note: This is a fixed and improved version of the original SLIME framework. We believe in being honest and transparent - this is essentially SLIME with bug fixes and optimizations that enable multi-turn iterative kernel improvement through compilation feedback and performance metrics.

Learn More →

</td> <td align="center" width="50%">

📊 KBenchEval

Comprehensive Benchmark Suite

Based on ScalingIntelligence/KernelBench, evaluating GPU kernel generation quality and performance across 200+ problems with varying difficulty levels

Learn More →

</td> </tr> </table> </div>

🚀 Quick Start

Prerequisites

| Requirement | NVIDIA | AMD | |------------|--------|-----| | Verified GPU | H100 | MI300X | | Memory | 80GB | 192GB | | Docker | ✅ Required | ✅ Required | | Python | 3.10+ | 3.10+ | | CUDA/ROCm | 12.6.1 | 6.3.4 |

</div>

Installation

Choose your platform and follow the setup guide:

</div> <details id="nvidia-setup"> <summary><b>📗 NVIDIA Setup</b></summary>

1. Launch Docker Container

docker pull zhuzilin/slime:20250706-v2

docker run --rm --gpus all --ipc=host --shm-size=128g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v $HOME:$HOME \
  -it zhuzilin/slime:20250706-v2 /bin/bash

2. Clone Repository

git clone https://github.com/RLsys-Foundation/TritonForge.git
cd TritonForge

3. Setup KBenchEval

cd KBenchEval

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

pip install -e .

deactivate

4. Setup SLIME

cd ../SLIME
pip install -e .

5. Download Models

# Create models directory
mkdir -p models

# Hugging Face format of fine-tuned Qwen3-8B model (for evaluation)
huggingface-cli download JinnP/Qwen3-8B-Kernelbook-SFT-HF --local-dir models/Qwen3-8B-Kernelbook-SFT-HF

# Megatron format of fine-tuned Qwen3-8B model (for continued training)
huggingface-cli download JinnP/Qwen3-8B-Kernelbook-SFT-filtered --local-dir models/Qwen3-8B-Kernelbook-SFT-filtered

# Base Qwen3-8B model (HuggingFace format)
huggingface-cli download Qwen/Qwen3-8B --local-dir models/Qwen3-8B

# Base Qwen3-8B model (Megatron format)
huggingface-cli download zyzshishui0627/Qwen3-8B_torch_dist --local-dir models/Qwen3-8B_torch_dist

</details> <details id="amd-setup"> <summary><b>📕 AMD Setup</b></summary>

1. Launch Docker Container

docker pull rlsys/tritonforge:stable

docker run -it \
  --device /dev/dri \
  --device /dev/kfd \
  --group-add video \
  --cap-add SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --privileged \
  --shm-size 128G \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v "$HOME/.ssh:/root/.ssh:ro" \
  -v "$HOME:$HOME" \
  -e HF_HOME="$HOME/.cache/huggingface" \
  -e TRANSFORMERS_CACHE="$HOME/.cache/huggingface" \
  -e XDG_CACHE_HOME="$HOME/.cache" \
  -w "$PWD" \
  -p 127.0.0.1:18265:8265 \
  --name tritonforge_dev \
  rlsys/tritonforge:stable \
  /bin/bash

2. Clone Repository

git clone https://github.com/RLsys-Foundation/TritonForge.git
cd TritonForge

3. Setup SLIME

cd ../SLIME
pip install -e .

4. Set AMD Environment Variables

# Set AMD environment variables
# gfx942 is especially for MI300X
export ROCM_HOME=/opt/rocm
export HIP_PLATFORM=amd
export PYTORCH_ROCM_ARCH=gfx942
export PATH=$ROCM_HOME/bin:$PATH
export LD_LIBRARY_PATH=$ROCM_HOME/lib:$LD_LIBRARY_PATH
export SGLANG_API_KEY=local-key
export PYTHONPATH=/workspace/KernelBench:$PYTHONPATH

# AMD optimizations
export HSA_ENABLE_SDMA=0

# Prevent GPU core dumps
export HSA_ENABLE_COREDUMP=0
export AMD_LOG_LEVEL=0
export ROCM_DISABLE_CRASH_DUMP=1
export HIP_ENABLE_COREDUMP=0
export HSA_TOOLS_LIB=/opt/rocm/lib/librocm-debug-agent.so.2:0
export GPU_MAX_HW_QUEUES=1

5. Set up KBenchEval for MI300X

cd KBenchEval

# need to install missing packages
pip install pydra_config==0.0.15 # May need to do something fix for pydra
cd /usr/local/lib/python3.12/dist-packages && ln -sf pydra_config pydra
pip install together
pip install google-generativeai

# No more virtual environment here cause we're just using Python path in the docker
# Install dependencies
cd /root/TritonForge/KBenchEval
pip install -e .

6. Download Models

# Download the same models as NVIDIA setup
huggingface-cli download JinnP/Qwen3-8B-Kernelbook-SFT-HF --local-dir /root/Qwen3-8B-Kernelbook-SFT-HF
huggingface-cli download JinnP/Qwen3-8B-Kernelbook-SFT-filtered --local-dir /root/Qwen3-8B-Kernelbook-SFT-filtered
huggingface-cli download Qwen/Qwen3-8B --local-dir /root/Qwen3-8B
huggingface-cli download zyzshishui0627/Qwen3-8B_torch_dist --local-dir /root/Qwen3-8B_torch_dist

</details>

🎓 Training Pipeline

📖 Detailed Architecture: See our comprehensive Architecture Documentation for the complete server-based SFT + RL framework design.

Stage 1: Supervised Fine-Tuning (SFT)

We leverage the same SLIME framework for both SFT and RL stages, providing a unified training pipeline. The SFT stage fine-tunes the base Qwen3-8B model using:

GPUMODE/KernelBook: 18.2k curated PyTorch-to-Triton code pairs (filtered to ~17k)
Custom data augmentations: Multi-turn conversations, thinking tags, and length filtering

Training Configuration (SLIME/scripts/run-qwen3-8B-kernelbook-sft.sh):

| Parameter | Value | Purpose | |-----------|-------|---------| | Tensor Parallel (TP) | 2 | Splits model across 2 GPUs for memory efficiency | | Context Parallel (CP) | 4 | Handles long sequences by splitting context | | Pipeline Parallel (PP) | 1 | No pipeline parallelism | | Data Parallel (DP) | 1 | Single data parallel replica | | Batch Size

TritonForge

Install / Use

README

TritonForge

🔥 Forging Optimal GPU Kernels through SFT + RL

🌟 Highlights

📰 News

🎯 Overview

🌍 Fully Open-Source Initiative

🧠 SLIME

📊 KBenchEval

🚀 Quick Start

Prerequisites

Installation

1. Launch Docker Container

2. Clone Repository

3. Setup KBenchEval

4. Setup SLIME

5. Download Models

1. Launch Docker Container

2. Clone Repository

3. Setup SLIME

4. Set AMD Environment Variables

5. Set up KBenchEval for MI300X

6. Download Models

🎓 Training Pipeline

Stage 1: Supervised Fine-Tuning (SFT)