Autokernel
Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.
Install / Use
/learn @RightNow-AI/AutokernelREADME
AutoKernel
Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton or CUDA C++ kernels.

Inspired by @karpathy/autoresearch -- which demonstrated autonomous AI agents for LLM training research. AutoKernel applies the same philosophy to GPU kernel optimization: agent modifies one file, runs a fixed evaluation, keeps or reverts, repeats forever.
How It Works
Give AutoKernel any PyTorch model. It will:
- Profile the model to find which GPU kernels are bottlenecks
- Extract each bottleneck as a standalone Triton or CUDA C++ kernel
- Optimize each kernel autonomously (edit, benchmark, keep/revert -- forever)
- Verify end-to-end correctness and report the total speedup
The agent reads program.md -- the "research org code" -- which contains comprehensive instructions for autonomous operation. It edits kernel.py one kernel at a time, runs bench.py (fixed benchmark with 5-stage correctness checks + roofline analysis), and either keeps or reverts the change. The orchestrator decides when to move to the next kernel using Amdahl's law.
Each experiment takes ~90 seconds. That's ~40 experiments/hour, ~320 overnight, across all kernels.
Quick Start
Requirements: NVIDIA GPU (tested on H100/A100/RTX 4090), Python 3.10+, uv.
# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and setup
git clone https://github.com/RightNow-AI/autokernel.git
cd autokernel
uv sync
# One-time setup: test data + baselines
uv run prepare.py
# Profile a model (ships with GPT-2, LLaMA, BERT -- no transformers needed)
uv run profile.py --model models/llama_7b.py --class-name LlamaModel \
--input-shape 1,512 --dtype float16
# Extract top bottleneck kernels
uv run extract.py --top 5
# Verify benchmark works
uv run bench.py
Running the Agent
Spin up Claude, Codex, or any coding agent in this directory:
Read program.md and let's kick off a new experiment. Start with setup.
The agent will:
- Profile your model and present the optimization plan
- Create a branch (e.g.,
autokernel/mar10-llama7b) - Optimize each bottleneck kernel in priority order
- Verify end-to-end correctness and report total speedup
program.md is intentionally comprehensive so the agent can run 10+ hours without getting stuck. It includes a 6-tier optimization playbook, decision framework, crash handling, and Amdahl's law reasoning.
The Pipeline
profile.py extract.py bench.py (loop) verify.py
Any PyTorch ──> Rank kernels ──> Generate baseline ──> Optimize each ──> End-to-end
model by GPU time Triton/CUDA kernels kernel (agent) verification
| Tool | What it does |
|------|-------------|
| profile.py | Profiles any PyTorch model with torch.profiler, ranks kernels by GPU time, classifies as compute/memory-bound |
| extract.py | Extracts top-N bottleneck kernels into standalone Triton or CUDA C++ kernel files (--backend triton\|cuda) |
| orchestrate.py | Multi-kernel scheduler: decides which kernel to optimize next using Amdahl's law, tracks aggregate progress |
| bench.py | Fixed benchmark: 5-stage correctness (smoke, shape sweep, numerical stability, determinism, edge cases) + performance + roofline |
| verify.py | Plugs optimized kernels back into the model, checks end-to-end correctness, reports total speedup |
Supported Kernels
9 kernel types covering the core operations of modern deep learning:
| Kernel | Description | Key Metric | |--------|-------------|------------| | matmul | Dense matrix multiplication (M x K) @ (K x N) | TFLOPS | | softmax | Row-parallel numerically stable softmax | GB/s | | layernorm | Layer normalization with affine transform | GB/s | | rmsnorm | RMS normalization (LLaMA-style) | GB/s | | flash_attention | Scaled dot-product attention with causal masking | TFLOPS | | fused_mlp | SwiGLU-style fused MLP (gate + up + down) | TFLOPS | | cross_entropy | Fused cross entropy loss | GB/s | | rotary_embedding | Rotary position embeddings (RoPE) | GB/s | | reduce | Parallel reduction (sum) | GB/s |
Each has a PyTorch reference in reference.py, a starter Triton kernel in kernels/, and a starter CUDA C++ kernel in kernels/cuda/.
Example Models
Self-contained model definitions ship with AutoKernel (no transformers library needed):
| Model | File | Params | Usage |
|-------|------|--------|-------|
| GPT-2 Small | models/gpt2.py | 124M | --class-name GPT2 --input-shape 1,1024 |
| LLaMA (compact) | models/llama_7b.py | 160M | --class-name LlamaModel --input-shape 1,512 |
| LLaMA 7B | models/llama_7b.py | 7B | --class-name LlamaModel7B --input-shape 1,2048 |
| BERT-base | models/bert_base.py | 110M | --class-name BertModel --input-shape 8,512 |
| Custom | models/custom.py | -- | Template for your own model |
For HuggingFace models (uv sync --extra models):
uv run profile.py --module transformers --class-name AutoModelForCausalLM \
--pretrained meta-llama/Llama-2-7b-hf --input-shape 1,2048 --dtype float16
KernelBench Integration
AutoKernel integrates with KernelBench, the standard benchmark for evaluating AI-generated GPU kernels (250+ problems across 4 difficulty levels). While most KernelBench evaluations use one-shot LLM generation, AutoKernel runs 50-300+ iterative refinement experiments per problem -- systematically exploring the optimization space instead of guessing.
# Install KernelBench dependencies
uv sync --extra kernelbench
# Fetch Level 1 problems from HuggingFace
uv run kernelbench/bridge.py fetch --source hf --level 1
# Set up a specific problem for optimization
uv run kernelbench/bridge.py setup --level 1 --problem 1 --source hf
# Evaluate (correctness + speedup vs PyTorch reference)
uv run kernelbench/bench_kb.py
# Batch score an entire level (computes fast_p metric)
uv run kernelbench/scorer.py --level 1
The agent reads kernelbench/program_kb.md for KernelBench-specific optimization instructions:
how to write ModelNew classes, when to use CUDA C++ vs Triton, fusion strategies per problem
level, and the edit-bench-keep/revert loop adapted for the KernelBench fast_p metric.
| Tool | What it does |
|------|-------------|
| kernelbench/bridge.py | Loads problems from HuggingFace or local repo, caches them, generates starter kernel.py |
| kernelbench/bench_kb.py | Evaluates ModelNew vs Model: 5-trial correctness + CUDA event timing + stability + determinism |
| kernelbench/scorer.py | Batch evaluation across a level, computes fast_p at thresholds (1.0x, 1.5x, 2.0x, 3.0x, 5.0x) |
| kernelbench/program_kb.md | Agent instructions for KernelBench mode |
HuggingFace Kernels Export
Export optimized kernels to the HuggingFace Hub for easy distribution. Users can then load your kernels with a single line:
from kernels import get_kernel
module = get_kernel("your-username/kernel-name")
# Export an optimized CUDA kernel
uv run export_hf.py --name my_matmul
# Upload to Hub (requires `pip install kernels` and `huggingface-cli login`)
cd workspace/hf_export/my_matmul
kernels upload . --repo_id your-username/my_matmul
Project Structure
autokernel/
kernel.py the file the agent modifies (one kernel at a time)
program.md agent instructions -- the "research org code"
bench.py fixed benchmark + 5-stage correctness harness
reference.py PyTorch reference implementations (ground truth)
prepare.py one-time setup: test data, baselines
profile.py profile any PyTorch model, rank kernels by GPU time
extract.py extract bottleneck kernels into workspace/
orchestrate.py multi-kernel scheduler (Amdahl's law)
verify.py end-to-end model verification + speedup report
export_hf.py export optimized kernels to HuggingFace Kernels format
analysis.py experiment visualization (generates progress.png)
kernels/ starter Triton kernels (9 types)
kernels/cuda/ starter CUDA C++ kernels (9 types, tensor core accelerated)
kernelbench/ KernelBench integration (bridge, eval harness, scorer)
models/ self-contained model definitions (GPT-2, LLaMA, BERT)
workspace/ runtime artifacts (gitignored)
Design Choices
Dual backend: Triton + CUDA C++. Triton for fast iteration (Python-like syntax, compiles in seconds). CUDA C++ for maximum performance (direct access to tensor cores via wmma, PTX intrinsics, shared memory bank-conflict-free layouts). Triton regularly reaches 80-95% of cuBLAS; CUDA C++ can match or exceed it. Both backends share the same kernel_fn() interface -- bench.py runs identically on either.
Correctness first. The benchmark checks kernel output against PyTorch before measuring performance. A fast but wrong kernel is immediately reverted. This prevents the agent from "optimizing" by producing garbage.
Amdahl's law orchestration. The orchestrator prioritizes by impact. A 1.5x speedup on a 60% kernel (1.25x end-to-end) beats a 3x speedup on a 5% kernel (1.03x end-to-end). It moves on when diminishing returns set in.
Single file to modify. The agent only touches kernel.py. Scope stays manageable, diffs reviewable, reverts clean.
TSV logging. Results go to a plain results.tsv file. Human-readable, git-friendly, trivially parseable, no infrastructure.
