TritonForge
π₯ LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation feedback, cross-platform NVIDIA/AMD, Kernelbook + KernelBench
Install / Use
/learn @RLsys-Foundation/TritonForgeREADME
TritonForge
π₯ Forging Optimal GPU Kernels through SFT + RL
Transform PyTorch Operations into Optimized GPU Kernels with LLMs
π Documentation | ποΈ Architecture | π Quick Start | π Results | πΊοΈ Roadmap | π€ Contributing
</div>π Highlights
<div align="center">| Feature | Description | |---------|------------| | π Two-Stage Training | SFT on high-quality datasets followed by RL optimization | | π Multi-Turn Refinement | Iterative kernel improvement through compilation feedback | | β‘ Cross-Platform | Support for both NVIDIA CUDA and AMD ROCm GPUs | | π Performance Metrics | Comprehensive evaluation of correctness and speedup | | π§ͺ 200+ Benchmarks | Extensive test suite across multiple difficulty levels |
</div>π° News
- [2025/10/09] π We just gave a talk about TritonForge as a guest speaker for Li Lab @ CMU! Slide here if u feel interested~
- [2025/09/29] π We released both English and Chinese versions of the TritonForge Tech Blog! English version | Chinese version (δΈζη)
π― Overview
TritonForge is an advanced machine learning framework that trains Large Language Models (LLMs) to automatically convert PyTorch operations into optimized Triton GPU kernels. By combining supervised fine-tuning (SFT) with reinforcement learning (RL), TritonForge achieves state-of-the-art performance in automated kernel generation.
ποΈ Architecture Deep Dive: For a comprehensive understanding of our server-based SFT + RL framework, evaluation infrastructure, and cross-platform support, see our Architecture Documentation.
π Fully Open-Source Initiative
We believe in complete transparency and community collaboration. Everything is open-source:
- π Training Data: Custom-curated datasets (GPUMODE/KernelBook)
- π€ Model Checkpoints: All intermediate and final models (HuggingFace)
- ποΈ Training Framework: Complete SLIME RL implementation (fixed version with improvements)
- π³ Environment Setup: Docker images and configurations for both NVIDIA and AMD
- π Training Recipes: Detailed scripts and hyperparameters for reproduction
We invite the community to join us in advancing automated kernel generation together!
<div align="center"> <table> <tr> <td align="center" width="50%">π§ SLIME
Reinforcement Learning Framework
Note: This is a fixed and improved version of the original SLIME framework. We believe in being honest and transparent - this is essentially SLIME with bug fixes and optimizations that enable multi-turn iterative kernel improvement through compilation feedback and performance metrics.
</td> <td align="center" width="50%">π KBenchEval
Comprehensive Benchmark Suite
Based on ScalingIntelligence/KernelBench, evaluating GPU kernel generation quality and performance across 200+ problems with varying difficulty levels
</td> </tr> </table> </div>π Quick Start
Prerequisites
<div align="center">| Requirement | NVIDIA | AMD | |------------|--------|-----| | Verified GPU | H100 | MI300X | | Memory | 80GB | 192GB | | Docker | β Required | β Required | | Python | 3.10+ | 3.10+ | | CUDA/ROCm | 12.6.1 | 6.3.4 |
</div>Installation
Choose your platform and follow the setup guide:
<div align="center"><img src="https://img.shields.io/badge/NVIDIA-Setup-76B900?style=for-the-badge&logo=nvidia&logoColor=white" height="40"> Β Β Β Β <img src="https://img.shields.io/badge/AMD-Setup-ED1C24?style=for-the-badge&logo=amd&logoColor=white" height="40">
</div> <details id="nvidia-setup"> <summary><b>π NVIDIA Setup</b></summary>1. Launch Docker Container
docker pull zhuzilin/slime:20250706-v2
docker run --rm --gpus all --ipc=host --shm-size=128g \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v $HOME:$HOME \
-it zhuzilin/slime:20250706-v2 /bin/bash
2. Clone Repository
git clone https://github.com/RLsys-Foundation/TritonForge.git
cd TritonForge
3. Setup KBenchEval
cd KBenchEval
# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
pip install -e .
deactivate
4. Setup SLIME
cd ../SLIME
pip install -e .
5. Download Models
# Create models directory
mkdir -p models
# Hugging Face format of fine-tuned Qwen3-8B model (for evaluation)
huggingface-cli download JinnP/Qwen3-8B-Kernelbook-SFT-HF --local-dir models/Qwen3-8B-Kernelbook-SFT-HF
# Megatron format of fine-tuned Qwen3-8B model (for continued training)
huggingface-cli download JinnP/Qwen3-8B-Kernelbook-SFT-filtered --local-dir models/Qwen3-8B-Kernelbook-SFT-filtered
# Base Qwen3-8B model (HuggingFace format)
huggingface-cli download Qwen/Qwen3-8B --local-dir models/Qwen3-8B
# Base Qwen3-8B model (Megatron format)
huggingface-cli download zyzshishui0627/Qwen3-8B_torch_dist --local-dir models/Qwen3-8B_torch_dist
</details>
<details id="amd-setup">
<summary><b>π AMD Setup</b></summary>
1. Launch Docker Container
docker pull rlsys/tritonforge:stable
docker run -it \
--device /dev/dri \
--device /dev/kfd \
--group-add video \
--cap-add SYS_PTRACE \
--security-opt seccomp=unconfined \
--privileged \
--shm-size 128G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-v "$HOME/.ssh:/root/.ssh:ro" \
-v "$HOME:$HOME" \
-e HF_HOME="$HOME/.cache/huggingface" \
-e TRANSFORMERS_CACHE="$HOME/.cache/huggingface" \
-e XDG_CACHE_HOME="$HOME/.cache" \
-w "$PWD" \
-p 127.0.0.1:18265:8265 \
--name tritonforge_dev \
rlsys/tritonforge:stable \
/bin/bash
2. Clone Repository
git clone https://github.com/RLsys-Foundation/TritonForge.git
cd TritonForge
3. Setup SLIME
cd ../SLIME
pip install -e .
4. Set AMD Environment Variables
# Set AMD environment variables
# gfx942 is especially for MI300X
export ROCM_HOME=/opt/rocm
export HIP_PLATFORM=amd
export PYTORCH_ROCM_ARCH=gfx942
export PATH=$ROCM_HOME/bin:$PATH
export LD_LIBRARY_PATH=$ROCM_HOME/lib:$LD_LIBRARY_PATH
export SGLANG_API_KEY=local-key
export PYTHONPATH=/workspace/KernelBench:$PYTHONPATH
# AMD optimizations
export HSA_ENABLE_SDMA=0
# Prevent GPU core dumps
export HSA_ENABLE_COREDUMP=0
export AMD_LOG_LEVEL=0
export ROCM_DISABLE_CRASH_DUMP=1
export HIP_ENABLE_COREDUMP=0
export HSA_TOOLS_LIB=/opt/rocm/lib/librocm-debug-agent.so.2:0
export GPU_MAX_HW_QUEUES=1
5. Set up KBenchEval for MI300X
cd KBenchEval
# need to install missing packages
pip install pydra_config==0.0.15 # May need to do something fix for pydra
cd /usr/local/lib/python3.12/dist-packages && ln -sf pydra_config pydra
pip install together
pip install google-generativeai
# No more virtual environment here cause we're just using Python path in the docker
# Install dependencies
cd /root/TritonForge/KBenchEval
pip install -e .
6. Download Models
# Download the same models as NVIDIA setup
huggingface-cli download JinnP/Qwen3-8B-Kernelbook-SFT-HF --local-dir /root/Qwen3-8B-Kernelbook-SFT-HF
huggingface-cli download JinnP/Qwen3-8B-Kernelbook-SFT-filtered --local-dir /root/Qwen3-8B-Kernelbook-SFT-filtered
huggingface-cli download Qwen/Qwen3-8B --local-dir /root/Qwen3-8B
huggingface-cli download zyzshishui0627/Qwen3-8B_torch_dist --local-dir /root/Qwen3-8B_torch_dist
</details>
π Training Pipeline
<div align="center"> <img src="SLIME/imgs/tf_training_pipeline.png" alt="TritonForge Training Pipeline" width="100%"> </div>π Detailed Architecture: See our comprehensive Architecture Documentation for the complete server-based SFT + RL framework design.
Stage 1: Supervised Fine-Tuning (SFT)
We leverage the same SLIME framework for both SFT and RL stages, providing a unified training pipeline. The SFT stage fine-tunes the base Qwen3-8B model using:
- GPUMODE/KernelBook: 18.2k curated PyTorch-to-Triton code pairs (filtered to ~17k)
- Custom data augmentations: Multi-turn conversations, thinking tags, and length filtering
Training Configuration (SLIME/scripts/run-qwen3-8B-kernelbook-sft.sh):
| Parameter | Value | Purpose | |-----------|-------|---------| | Tensor Parallel (TP) | 2 | Splits model across 2 GPUs for memory efficiency | | Context Parallel (CP) | 4 | Handles long sequences by splitting context | | Pipeline Parallel (PP) | 1 | No pipeline parallelism | | Data Parallel (DP) | 1 | Single data parallel replica | | Batch Size
