AutoKernel

Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton or CUDA C++ kernels.

AutoKernel Progress

Inspired by @karpathy/autoresearch -- which demonstrated autonomous AI agents for LLM training research. AutoKernel applies the same philosophy to GPU kernel optimization: agent modifies one file, runs a fixed evaluation, keeps or reverts, repeats forever.

How It Works

Give AutoKernel any PyTorch model. It will:

Profile the model to find which GPU kernels are bottlenecks
Extract each bottleneck as a standalone Triton or CUDA C++ kernel
Optimize each kernel autonomously (edit, benchmark, keep/revert -- forever)
Verify end-to-end correctness and report the total speedup

The agent reads program.md -- the "research org code" -- which contains comprehensive instructions for autonomous operation. It edits kernel.py one kernel at a time, runs bench.py (fixed benchmark with 5-stage correctness checks + roofline analysis), and either keeps or reverts the change. The orchestrator decides when to move to the next kernel using Amdahl's law.

Each experiment takes ~90 seconds. That's ~40 experiments/hour, ~320 overnight, across all kernels.

Quick Start

Requirements: NVIDIA GPU (tested on H100/A100/RTX 4090), Python 3.10+, uv.

# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/RightNow-AI/autokernel.git
cd autokernel
uv sync

# One-time setup: test data + baselines
uv run prepare.py

# Profile a model (ships with GPT-2, LLaMA, BERT -- no transformers needed)
uv run profile.py --model models/llama_7b.py --class-name LlamaModel \
 --input-shape 1,512 --dtype float16

# Extract top bottleneck kernels
uv run extract.py --top 5

# Verify benchmark works
uv run bench.py

Running the Agent

Spin up Claude, Codex, or any coding agent in this directory:

Read program.md and let's kick off a new experiment. Start with setup.

The agent will:

Profile your model and present the optimization plan
Create a branch (e.g., autokernel/mar10-llama7b)
Optimize each bottleneck kernel in priority order
Verify end-to-end correctness and report total speedup

program.md is intentionally comprehensive so the agent can run 10+ hours without getting stuck. It includes a 6-tier optimization playbook, decision framework, crash handling, and Amdahl's law reasoning.

The Pipeline

                 profile.py              extract.py           bench.py (loop)         verify.py
Any PyTorch  ──>  Rank kernels  ──>  Generate baseline  ──>  Optimize each  ──>  End-to-end
   model          by GPU time       Triton/CUDA kernels     kernel (agent)       verification

| Tool | What it does | |------|-------------| | profile.py | Profiles any PyTorch model with torch.profiler, ranks kernels by GPU time, classifies as compute/memory-bound | | extract.py | Extracts top-N bottleneck kernels into standalone Triton or CUDA C++ kernel files (--backend triton\|cuda) | | orchestrate.py | Multi-kernel scheduler: decides which kernel to optimize next using Amdahl's law, tracks aggregate progress | | bench.py | Fixed benchmark: 5-stage correctness (smoke, shape sweep, numerical stability, determinism, edge cases) + performance + roofline | | verify.py | Plugs optimized kernels back into the model, checks end-to-end correctness, reports total speedup |

Supported Kernels

9 kernel types covering the core operations of modern deep learning:

| Kernel | Description | Key Metric | |--------|-------------|------------| | matmul | Dense matrix multiplication (M x K) @ (K x N) | TFLOPS | | softmax | Row-parallel numerically stable softmax | GB/s | | layernorm | Layer normalization with affine transform | GB/s | | rmsnorm | RMS normalization (LLaMA-style) | GB/s | | flash_attention | Scaled dot-product attention with causal masking | TFLOPS | | fused_mlp | SwiGLU-style fused MLP (gate + up + down) | TFLOPS | | cross_entropy | Fused cross entropy loss | GB/s | | rotary_embedding | Rotary position embeddings (RoPE) | GB/s | | reduce | Parallel reduction (sum) | GB/s |

Each has a PyTorch reference in reference.py, a starter Triton kernel in kernels/, and a starter CUDA C++ kernel in kernels/cuda/.

Example Models

Self-contained model definitions ship with AutoKernel (no transformers library needed):

| Model | File | Params | Usage | |-------|------|--------|-------| | GPT-2 Small | models/gpt2.py | 124M | --class-name GPT2 --input-shape 1,1024 | | LLaMA (compact) | models/llama_7b.py | 160M | --class-name LlamaModel --input-shape 1,512 | | LLaMA 7B | models/llama_7b.py | 7B | --class-name LlamaModel7B --input-shape 1,2048 | | BERT-base | models/bert_base.py | 110M | --class-name BertModel --input-shape 8,512 | | Custom | models/custom.py | -- | Template for your own model |

For HuggingFace models (uv sync --extra models):

uv run profile.py --module transformers --class-name AutoModelForCausalLM \
 --pretrained meta-llama/Llama-2-7b-hf --input-shape 1,2048 --dtype float16

KernelBench Integration

AutoKernel integrates with KernelBench, the standard benchmark for evaluating AI-generated GPU kernels (250+ problems across 4 difficulty levels). While most KernelBench evaluations use one-shot LLM generation, AutoKernel runs 50-300+ iterative refinement experiments per problem -- systematically exploring the optimization space instead of guessing.

# Install KernelBench dependencies
uv sync --extra kernelbench

# Fetch Level 1 problems from HuggingFace
uv run kernelbench/bridge.py fetch --source hf --level 1

# Set up a specific problem for optimization
uv run kernelbench/bridge.py setup --level 1 --problem 1 --source hf

# Evaluate (correctness + speedup vs PyTorch reference)
uv run kernelbench/bench_kb.py

# Batch score an entire level (computes fast_p metric)
uv run kernelbench/scorer.py --level 1

The agent reads kernelbench/program_kb.md for KernelBench-specific optimization instructions: how to write ModelNew classes, when to use CUDA C++ vs Triton, fusion strategies per problem level, and the edit-bench-keep/revert loop adapted for the KernelBench fast_p metric.

| Tool | What it does | |------|-------------| | kernelbench/bridge.py | Loads problems from HuggingFace or local repo, caches them, generates starter kernel.py | | kernelbench/bench_kb.py | Evaluates ModelNew vs Model: 5-trial correctness + CUDA event timing + stability + determinism | | kernelbench/scorer.py | Batch evaluation across a level, computes fast_p at thresholds (1.0x, 1.5x, 2.0x, 3.0x, 5.0x) | | kernelbench/program_kb.md | Agent instructions for KernelBench mode |

HuggingFace Kernels Export

Export optimized kernels to the HuggingFace Hub for easy distribution. Users can then load your kernels with a single line:

from kernels import get_kernel
module = get_kernel("your-username/kernel-name")

# Export an optimized CUDA kernel
uv run export_hf.py --name my_matmul

# Upload to Hub (requires `pip install kernels` and `huggingface-cli login`)
cd workspace/hf_export/my_matmul
kernels upload . --repo_id your-username/my_matmul

Project Structure

autokernel/
  kernel.py             the file the agent modifies (one kernel at a time)
  program.md            agent instructions -- the "research org code"

  bench.py              fixed benchmark + 5-stage correctness harness
  reference.py          PyTorch reference implementations (ground truth)
  prepare.py            one-time setup: test data, baselines

  profile.py            profile any PyTorch model, rank kernels by GPU time
  extract.py            extract bottleneck kernels into workspace/
  orchestrate.py        multi-kernel scheduler (Amdahl's law)
  verify.py             end-to-end model verification + speedup report
  export_hf.py          export optimized kernels to HuggingFace Kernels format
  analysis.py           experiment visualization (generates progress.png)

  kernels/              starter Triton kernels (9 types)
  kernels/cuda/         starter CUDA C++ kernels (9 types, tensor core accelerated)
  kernelbench/          KernelBench integration (bridge, eval harness, scorer)
  models/               self-contained model definitions (GPT-2, LLaMA, BERT)
  workspace/            runtime artifacts (gitignored)

Design Choices

Dual backend: Triton + CUDA C++. Triton for fast iteration (Python-like syntax, compiles in seconds). CUDA C++ for maximum performance (direct access to tensor cores via wmma, PTX intrinsics, shared memory bank-conflict-free layouts). Triton regularly reaches 80-95% of cuBLAS; CUDA C++ can match or exceed it. Both backends share the same kernel_fn() interface -- bench.py runs identically on either.

Correctness first. The benchmark checks kernel output against PyTorch before measuring performance. A fast but wrong kernel is immediately reverted. This prevents the agent from "optimizing" by producing garbage.

Amdahl's law orchestration. The orchestrator prioritizes by impact. A 1.5x speedup on a 60% kernel (1.25x end-to-end) beats a 3x speedup on a 5% kernel (1.03x end-to-end). It moves on when diminishing returns set in.

Single file to modify. The agent only touches kernel.py. Scope stays manageable, diffs reviewable, reverts clean.

TSV logging. Results go to a plain results.tsv file. Human-readable, git-friendly, trivially parseable, no infrastructure.

Autokernel

Install / Use

README