SkillAgentSearch skills...

TritonForge

πŸ”₯ LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation feedback, cross-platform NVIDIA/AMD, Kernelbook + KernelBench

Install / Use

/learn @RLsys-Foundation/TritonForge
About this skill

Quality Score

0/100

Supported Platforms

Zed

README

<div align="center"> <img src="docs/assets/TritonForge_logo.png" alt="TritonForge Logo" width="400"/>

TritonForge

πŸ”₯ Forging Optimal GPU Kernels through SFT + RL

License Python CUDA ROCm Ask DeepWiki

Transform PyTorch Operations into Optimized GPU Kernels with LLMs

πŸ“š Documentation | πŸ—οΈ Architecture | πŸš€ Quick Start | πŸ“Š Results | πŸ—ΊοΈ Roadmap | 🀝 Contributing

</div>

🌟 Highlights

<div align="center">

| Feature | Description | |---------|------------| | πŸŽ“ Two-Stage Training | SFT on high-quality datasets followed by RL optimization | | πŸ”„ Multi-Turn Refinement | Iterative kernel improvement through compilation feedback | | ⚑ Cross-Platform | Support for both NVIDIA CUDA and AMD ROCm GPUs | | πŸ“ˆ Performance Metrics | Comprehensive evaluation of correctness and speedup | | πŸ§ͺ 200+ Benchmarks | Extensive test suite across multiple difficulty levels |

</div>

πŸ“° News

🎯 Overview

TritonForge is an advanced machine learning framework that trains Large Language Models (LLMs) to automatically convert PyTorch operations into optimized Triton GPU kernels. By combining supervised fine-tuning (SFT) with reinforcement learning (RL), TritonForge achieves state-of-the-art performance in automated kernel generation.

πŸ—οΈ Architecture Deep Dive: For a comprehensive understanding of our server-based SFT + RL framework, evaluation infrastructure, and cross-platform support, see our Architecture Documentation.

🌍 Fully Open-Source Initiative

We believe in complete transparency and community collaboration. Everything is open-source:

  • πŸ“š Training Data: Custom-curated datasets (GPUMODE/KernelBook)
  • πŸ€– Model Checkpoints: All intermediate and final models (HuggingFace)
  • πŸ—οΈ Training Framework: Complete SLIME RL implementation (fixed version with improvements)
  • 🐳 Environment Setup: Docker images and configurations for both NVIDIA and AMD
  • πŸ“– Training Recipes: Detailed scripts and hyperparameters for reproduction

We invite the community to join us in advancing automated kernel generation together!

<div align="center"> <table> <tr> <td align="center" width="50%">

🧠 SLIME

Reinforcement Learning Framework

Note: This is a fixed and improved version of the original SLIME framework. We believe in being honest and transparent - this is essentially SLIME with bug fixes and optimizations that enable multi-turn iterative kernel improvement through compilation feedback and performance metrics.

Learn More β†’

</td> <td align="center" width="50%">

πŸ“Š KBenchEval

Comprehensive Benchmark Suite

Based on ScalingIntelligence/KernelBench, evaluating GPU kernel generation quality and performance across 200+ problems with varying difficulty levels

Learn More β†’

</td> </tr> </table> </div>

πŸš€ Quick Start

Prerequisites

<div align="center">

| Requirement | NVIDIA | AMD | |------------|--------|-----| | Verified GPU | H100 | MI300X | | Memory | 80GB | 192GB | | Docker | βœ… Required | βœ… Required | | Python | 3.10+ | 3.10+ | | CUDA/ROCm | 12.6.1 | 6.3.4 |

</div>

Installation

Choose your platform and follow the setup guide:

<div align="center">

<img src="https://img.shields.io/badge/NVIDIA-Setup-76B900?style=for-the-badge&logo=nvidia&logoColor=white" height="40"> Β Β Β Β  <img src="https://img.shields.io/badge/AMD-Setup-ED1C24?style=for-the-badge&logo=amd&logoColor=white" height="40">

</div> <details id="nvidia-setup"> <summary><b>πŸ“— NVIDIA Setup</b></summary>

1. Launch Docker Container

docker pull zhuzilin/slime:20250706-v2

docker run --rm --gpus all --ipc=host --shm-size=128g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v $HOME:$HOME \
  -it zhuzilin/slime:20250706-v2 /bin/bash

2. Clone Repository

git clone https://github.com/RLsys-Foundation/TritonForge.git
cd TritonForge

3. Setup KBenchEval

cd KBenchEval

# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

pip install -e .

deactivate

4. Setup SLIME

cd ../SLIME
pip install -e .

5. Download Models

# Create models directory
mkdir -p models

# Hugging Face format of fine-tuned Qwen3-8B model (for evaluation)
huggingface-cli download JinnP/Qwen3-8B-Kernelbook-SFT-HF --local-dir models/Qwen3-8B-Kernelbook-SFT-HF

# Megatron format of fine-tuned Qwen3-8B model (for continued training)
huggingface-cli download JinnP/Qwen3-8B-Kernelbook-SFT-filtered --local-dir models/Qwen3-8B-Kernelbook-SFT-filtered

# Base Qwen3-8B model (HuggingFace format)
huggingface-cli download Qwen/Qwen3-8B --local-dir models/Qwen3-8B

# Base Qwen3-8B model (Megatron format)
huggingface-cli download zyzshishui0627/Qwen3-8B_torch_dist --local-dir models/Qwen3-8B_torch_dist
</details> <details id="amd-setup"> <summary><b>πŸ“• AMD Setup</b></summary>

1. Launch Docker Container

docker pull rlsys/tritonforge:stable

docker run -it \
  --device /dev/dri \
  --device /dev/kfd \
  --group-add video \
  --cap-add SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --privileged \
  --shm-size 128G \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -v "$HOME/.ssh:/root/.ssh:ro" \
  -v "$HOME:$HOME" \
  -e HF_HOME="$HOME/.cache/huggingface" \
  -e TRANSFORMERS_CACHE="$HOME/.cache/huggingface" \
  -e XDG_CACHE_HOME="$HOME/.cache" \
  -w "$PWD" \
  -p 127.0.0.1:18265:8265 \
  --name tritonforge_dev \
  rlsys/tritonforge:stable \
  /bin/bash

2. Clone Repository

git clone https://github.com/RLsys-Foundation/TritonForge.git
cd TritonForge

3. Setup SLIME

cd ../SLIME
pip install -e .

4. Set AMD Environment Variables

# Set AMD environment variables
# gfx942 is especially for MI300X
export ROCM_HOME=/opt/rocm
export HIP_PLATFORM=amd
export PYTORCH_ROCM_ARCH=gfx942
export PATH=$ROCM_HOME/bin:$PATH
export LD_LIBRARY_PATH=$ROCM_HOME/lib:$LD_LIBRARY_PATH
export SGLANG_API_KEY=local-key
export PYTHONPATH=/workspace/KernelBench:$PYTHONPATH

# AMD optimizations
export HSA_ENABLE_SDMA=0

# Prevent GPU core dumps
export HSA_ENABLE_COREDUMP=0
export AMD_LOG_LEVEL=0
export ROCM_DISABLE_CRASH_DUMP=1
export HIP_ENABLE_COREDUMP=0
export HSA_TOOLS_LIB=/opt/rocm/lib/librocm-debug-agent.so.2:0
export GPU_MAX_HW_QUEUES=1

5. Set up KBenchEval for MI300X

cd KBenchEval

# need to install missing packages
pip install pydra_config==0.0.15 # May need to do something fix for pydra
cd /usr/local/lib/python3.12/dist-packages && ln -sf pydra_config pydra
pip install together
pip install google-generativeai

# No more virtual environment here cause we're just using Python path in the docker
# Install dependencies
cd /root/TritonForge/KBenchEval
pip install -e .

6. Download Models

# Download the same models as NVIDIA setup
huggingface-cli download JinnP/Qwen3-8B-Kernelbook-SFT-HF --local-dir /root/Qwen3-8B-Kernelbook-SFT-HF
huggingface-cli download JinnP/Qwen3-8B-Kernelbook-SFT-filtered --local-dir /root/Qwen3-8B-Kernelbook-SFT-filtered
huggingface-cli download Qwen/Qwen3-8B --local-dir /root/Qwen3-8B
huggingface-cli download zyzshishui0627/Qwen3-8B_torch_dist --local-dir /root/Qwen3-8B_torch_dist
</details>

πŸŽ“ Training Pipeline

<div align="center"> <img src="SLIME/imgs/tf_training_pipeline.png" alt="TritonForge Training Pipeline" width="100%"> </div>

πŸ“– Detailed Architecture: See our comprehensive Architecture Documentation for the complete server-based SFT + RL framework design.

Stage 1: Supervised Fine-Tuning (SFT)

We leverage the same SLIME framework for both SFT and RL stages, providing a unified training pipeline. The SFT stage fine-tunes the base Qwen3-8B model using:

  • GPUMODE/KernelBook: 18.2k curated PyTorch-to-Triton code pairs (filtered to ~17k)
  • Custom data augmentations: Multi-turn conversations, thinking tags, and length filtering

Training Configuration (SLIME/scripts/run-qwen3-8B-kernelbook-sft.sh):

<div align="center">

| Parameter | Value | Purpose | |-----------|-------|---------| | Tensor Parallel (TP) | 2 | Splits model across 2 GPUs for memory efficiency | | Context Parallel (CP) | 4 | Handles long sequences by splitting context | | Pipeline Parallel (PP) | 1 | No pipeline parallelism | | Data Parallel (DP) | 1 | Single data parallel replica | | Batch Size

View on GitHub
GitHub Stars134
CategoryDevelopment
Updated14h ago
Forks6

Languages

Python

Security Score

80/100

Audited on Mar 28, 2026

No findings