KernelAgent
Autonomous GPU Kernel Generation & Optimization via Deep Agents
Install / Use
/learn @meta-pytorch/KernelAgentREADME
KernelAgent — Multi‑Agent GPU Kernel Synthesis and Optimization
KernelAgent turns PyTorch programs into verified Triton kernels and optimize its performance. It was designed around KernelBench workloads and combines:
- Static problem analysis to decide whether to run a lightweight path or a full pipeline
- LLM‑assisted refactoring that isolates fusable subgraphs
- Parallel Triton kernel generation with strict runtime verification
- End‑to‑end composition that rebuilds the original forward pass using only the synthesized kernels
- Hardware‑guided optimization pipeline that iteratively improves performance
GPU Kernel Synthesis Blog post: PyTorch KernelFalcon
GPU Kernel Optimization Blog post: PyTorch KernelAgent
Kernel Generation Pipeline Overview
Every stage writes artifacts to a run directory under .fuse/<run_id>/, including the fused PyTorch code, subgraphs.json, individual KernelAgent sessions, and the final compose_out/composed_kernel.py.
KernelAgent Multi-Worker Optimization Pipeline Overview
Every stage writes artifacts to a run directory under
.optimize/<run_id>/, including the input Triton kernel, artifacts, individual optimization worker sessions, and the final output/best_kernel.py.
Quickstart
Requirements
- Python 3.8 – 3.12
- Linux or macOS
- GPU Requirements (one of the following):
- CUDA: NVIDIA GPU with CUDA support
- XPU: Intel GPU with oneAPI support (Arc, Data Center GPUs, or integrated Xe graphics)
- Triton (installed separately:
pip install tritonor nightly from source) - PyTorch (https://pytorch.org/get-started/locally/)
- LLM provider (OpenAI, Anthropic, or a self-hosted relay)
Install
pip install -e .
Platform-Specific PyTorch Installation
Intel XPU (Intel GPUs)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/xpu
Note: Intel XPU support requires:
- Compatible Intel GPU (Arc series, Data Center GPUs, or integrated Xe graphics)
- Linux with appropriate Intel GPU drivers
Verify your XPU installation:
import torch
print(torch.xpu.is_available()) # Should print True
print(torch.xpu.device_count()) # Number of Intel GPUs
(Optional) Install KernelBench for problem examples
git clone https://github.com/ScalingIntelligence/KernelBench.git
Note: By default, KernelAgent UI searches for KernelBench at the same level as KernelAgent. (i.e. ../KernelBench)
Configure
You can export keys directly or use an .env file that the CLIs load automatically.
OPENAI_MODEL=gpt-5 # default model for extraction
NUM_KERNEL_SEEDS=4 # parallel workers per kernel
MAX_REFINEMENT_ROUNDS=10 # retry budget per worker
LOG_LEVEL=INFO # logging level
LLM Providers
KernelAgent currently supports OpenAI and Anthropic out-of-the-box. You can also use a custom OpenAI endpoint.
These can be configured in .env or via environment variables.
# OpenAI (models like `o4-mini`, `gpt-5`)
OPENAI_API_KEY=sk-...
# Anthropic (default; `claude-sonnet-4-20250514` is used when `OPENAI_MODEL` is unset)
ANTHROPIC_API_KEY=sk-ant-...
# Relay configuration for self-hosted gateways
LLM_RELAY_URL=http://127.0.0.1:11434
LLM_RELAY_TIMEOUT_S=120
More knobs live in triton_kernel_agent/agent.py and Fuser/config.py.
End-to-End Kernel Generation Workflows
-
Auto-route a KernelBench problem — static analysis picks between the direct KernelAgent path and the full Fuser pipeline, with automatic fallback if the first attempt fails:
python -m Fuser.auto_agent \ --problem /abs/path/to/KernelBench/level1/19_ReLU.py \ --no-router-cache \ # avoid caching or using cached results --verify # ensure final composition test runs--no-router-cachecan be enabled to avoid utilizing any cached router results and prevent writing to the cache. -
Manually run the pipeline (extract → dispatch → compose) when you want explicit control over models or concurrency:
python -m Fuser.pipeline \ --problem /abs/path/to/problem.py \ --extract-model gpt-5 \ --dispatch-model o4-mini \ --dispatch-jobs auto \ --compose-model o4-mini \ --workers 4 \ --max-iters 5 \ --verify # For Intel XPU python -m Fuser.pipeline \ --problem /abs/path/to/problem.py \ --target-platform xpu \ --extract-model gpt-5 \ --dispatch-model o4-mini \ --dispatch-jobs auto \ --compose-model o4-mini \ --workers 4 \ --max-iters 5 \ --verifydispatch-jobs automatches the number of discovered subgraphs; artifacts are placed under.fuse/<run_id>/. -
Direct KernelAgent run — bypass Fuser and provide a plain language problem description or a KernelBench snippet:
from triton_kernel_agent import TritonKernelAgent agent = TritonKernelAgent(num_workers=4, max_rounds=8, model_name="gpt-5") result = agent.generate_kernel( problem_description="Implement ReLU over a contiguous 1D tensor of length 1024" ) if result["success"]: print("Kernel path:", result["kernel_path"]) print("Session directory:", result["session_dir"]) else: print("Failure:", result["message"]) -
UIs — interactive runs with Gradio frontends:
- Triton KernelAgent UI:
kernel-agentorpython scripts/triton_ui.py - Fuser orchestration UI:
fuser-uiorpython scripts/fuser_ui - Full pipeline UI:
pipeline-uiorpython scripts/pipeline_ui
- Triton KernelAgent UI:
Component Details
-
AutoRouter (
Fuser/auto_agent.py): parses the problem’s AST, looks for attention blocks, transposed convolutions, control flow, and long op chains. It caches decisions under.fuse/router_cache.jsonand can fall back to the other path if the first attempt fails. Use--no-router-cacheto ignore the existing cache and caching new routes. Use--ignore-router-configto ignore router-provided tuning and rely on CLI args. -
Fuser Orchestrator (
Fuser/orchestrator.py): rewrites the PyTorch module into fusable modules, executes them for validation, and packages a tarball of the fused code. Run IDs and directories are managed viaFuser/paths.py. -
Subgraph Extractor (
Fuser/subgraph_extractor.py): prompts the LLM to emit a JSON array describing each unique subgraph, including ops, shapes, dtypes, and parameter tensors. Entries are deduplicated by shape signature so the dispatcher can reuse kernels. -
Dispatcher (
Fuser/dispatch_kernel_agent.py): converts each JSON item into a precise Triton generation spec, then spins upTritonKernelAgentprocesses in parallel. Each worker writes its own session directory with the candidate kernel, test harness, and verification logs. -
TritonKernelAgent (
triton_kernel_agent/): manages a pool of verification workers (worker.py,manager.py). Each worker iteratively asks an LLM for improvements, executes unit tests under sandboxed subprocesses (Fuser/runner.py), and enforces strict bans on PyTorch fallbacks. A run succeeds only when the test printsPASS(or the sentinel string) and exits with status 0. -
Composer (
Fuser/compose_end_to_end.py): stitches the verified kernels back into a single Triton program. The composed file contains one or more@triton.jitkernels plus akernel_function(...)wrapper and a self-test that replays the original PyTorch problem. With--verify, the test is executed immediately and must succeed.
End-to-End Kernel Optimization Workflows
KernelAgent includes a hardware-guided optimization pipeline that iteratively improves a verified Triton kernel's performance using GPU profiling feedback.
- Profile — NCU collects 28 hardware metrics (compute utilization, memory bandwidth, cache hit rates, occupancy, stall breakdowns)
- Roofline Analysis — Classifies the kernel as memory-bound, compute-bound, or underutilized based on SOL (speed-of-light) percentages
- Bottleneck Diagnosis — An LLM analyzes the NCU metrics + kernel code to identify root causes and recommend specific fixes
- Optimization — An LLM generates an optimized kernel applying the recommended fixes
- Verification — The optimized kernel is tested for numerical correctness against PyTorch reference
- Benchmarking — CUDA event timing measures the new kernel, tracking best-so-far with divergence-based revert
The loop runs for up to N rounds, with early termination when the kernel reaches roofline (≥95% SOL) or when performance converges.
Usage
Gradio UI
python scripts/optimization_ui.py --port 8085
Programmatic API
Optimize a kernel using beam search — parallel exploration with top-N kernels and M bottleneck directions:
cd examples && python run_opt_manager.py \
--kernel-dir optimize_01_matvec/ \
--strategy beam_search \
--max-rounds 5
Key Components
| Component | Location | Role |
|---|---|---|
| OptimizationOrchestrator | triton_kernel_agent/opt_worker_component/orchestrator/ | Main optimization loop |
| KernelProfiler | triton_kernel_agent/opt_worker_component/profiling/ | NCU hardware profiling |
| BottleneckAnalyzer | triton_kernel_agent/opt_worker_component/prescribing/ | LLM-based bottleneck diagnosis |
| RooflineAnalyzer | kernel_perf_agent/kernel_opt/roofline/ | SOL classification and early stopping |
| Benchmark | triton_kernel_agent/opt_worker_component/benchmarking/ | CUDA event timing |
Optimization Artifacts
.optimize/workers/<worker_id>/<run_id>/artifacts
kernel_round_0.py
