KernelAgent — Multi‑Agent GPU Kernel Synthesis and Optimization

KernelAgent turns PyTorch programs into verified Triton kernels and optimize its performance. It was designed around KernelBench workloads and combines:

Static problem analysis to decide whether to run a lightweight path or a full pipeline
LLM‑assisted refactoring that isolates fusable subgraphs
Parallel Triton kernel generation with strict runtime verification
End‑to‑end composition that rebuilds the original forward pass using only the synthesized kernels
Hardware‑guided optimization pipeline that iteratively improves performance

GPU Kernel Synthesis Blog post: PyTorch KernelFalcon

GPU Kernel Optimization Blog post: PyTorch KernelAgent

Kernel Generation Pipeline Overview

Every stage writes artifacts to a run directory under .fuse/<run_id>/, including the fused PyTorch code, subgraphs.json, individual KernelAgent sessions, and the final compose_out/composed_kernel.py.

KernelAgent Multi-Worker Optimization Pipeline Overview

Every stage writes artifacts to a run directory under .optimize/<run_id>/, including the input Triton kernel, artifacts, individual optimization worker sessions, and the final output/best_kernel.py.

Quickstart

Requirements

Python 3.8 – 3.12
Linux or macOS
GPU Requirements (one of the following):
- CUDA: NVIDIA GPU with CUDA support
- XPU: Intel GPU with oneAPI support (Arc, Data Center GPUs, or integrated Xe graphics)
Triton (installed separately: pip install triton or nightly from source)
PyTorch (https://pytorch.org/get-started/locally/)
LLM provider (OpenAI, Anthropic, or a self-hosted relay)

Install

pip install -e .

Platform-Specific PyTorch Installation

Intel XPU (Intel GPUs)

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/xpu

Note: Intel XPU support requires:

Compatible Intel GPU (Arc series, Data Center GPUs, or integrated Xe graphics)
Linux with appropriate Intel GPU drivers

Verify your XPU installation:

import torch
print(torch.xpu.is_available())  # Should print True
print(torch.xpu.device_count())  # Number of Intel GPUs

(Optional) Install KernelBench for problem examples

git clone https://github.com/ScalingIntelligence/KernelBench.git

Note: By default, KernelAgent UI searches for KernelBench at the same level as KernelAgent. (i.e. ../KernelBench)

Configure

You can export keys directly or use an .env file that the CLIs load automatically.

OPENAI_MODEL=gpt-5            # default model for extraction
NUM_KERNEL_SEEDS=4            # parallel workers per kernel
MAX_REFINEMENT_ROUNDS=10      # retry budget per worker
LOG_LEVEL=INFO                # logging level

LLM Providers

KernelAgent currently supports OpenAI and Anthropic out-of-the-box. You can also use a custom OpenAI endpoint. These can be configured in .env or via environment variables.

# OpenAI (models like `o4-mini`, `gpt-5`)
OPENAI_API_KEY=sk-...

# Anthropic (default; `claude-sonnet-4-20250514` is used when `OPENAI_MODEL` is unset)
ANTHROPIC_API_KEY=sk-ant-...

# Relay configuration for self-hosted gateways
LLM_RELAY_URL=http://127.0.0.1:11434
LLM_RELAY_TIMEOUT_S=120

More knobs live in triton_kernel_agent/agent.py and Fuser/config.py.

End-to-End Kernel Generation Workflows

Auto-route a KernelBench problem — static analysis picks between the direct KernelAgent path and the full Fuser pipeline, with automatic fallback if the first attempt fails:
```
python -m Fuser.auto_agent \
  --problem /abs/path/to/KernelBench/level1/19_ReLU.py \
  --no-router-cache \     # avoid caching or using cached results
  --verify                # ensure final composition test runs
```
--no-router-cache can be enabled to avoid utilizing any cached router results and prevent writing to the cache.

Manually run the pipeline (extract → dispatch → compose) when you want explicit control over models or concurrency:

python -m Fuser.pipeline \
  --problem /abs/path/to/problem.py \
  --extract-model gpt-5 \
  --dispatch-model o4-mini \
  --dispatch-jobs auto \
  --compose-model o4-mini \
  --workers 4 \
  --max-iters 5 \
  --verify

# For Intel XPU
python -m Fuser.pipeline \
  --problem /abs/path/to/problem.py \
  --target-platform xpu \
  --extract-model gpt-5 \
  --dispatch-model o4-mini \
  --dispatch-jobs auto \
  --compose-model o4-mini \
  --workers 4 \
  --max-iters 5 \
  --verify

dispatch-jobs auto matches the number of discovered subgraphs; artifacts are placed under .fuse/<run_id>/.

Direct KernelAgent run — bypass Fuser and provide a plain language problem description or a KernelBench snippet:

from triton_kernel_agent import TritonKernelAgent

agent = TritonKernelAgent(num_workers=4, max_rounds=8, model_name="gpt-5")
result = agent.generate_kernel(
    problem_description="Implement ReLU over a contiguous 1D tensor of length 1024"
)

if result["success"]:
    print("Kernel path:", result["kernel_path"])
    print("Session directory:", result["session_dir"])
else:
    print("Failure:", result["message"])

UIs — interactive runs with Gradio frontends:
- Triton KernelAgent UI: kernel-agent or python scripts/triton_ui.py
- Fuser orchestration UI: fuser-ui or python scripts/fuser_ui
- Full pipeline UI: pipeline-ui or python scripts/pipeline_ui

Component Details

AutoRouter (Fuser/auto_agent.py): parses the problem’s AST, looks for attention blocks, transposed convolutions, control flow, and long op chains. It caches decisions under .fuse/router_cache.json and can fall back to the other path if the first attempt fails. Use --no-router-cache to ignore the existing cache and caching new routes. Use --ignore-router-config to ignore router-provided tuning and rely on CLI args.
Fuser Orchestrator (Fuser/orchestrator.py): rewrites the PyTorch module into fusable modules, executes them for validation, and packages a tarball of the fused code. Run IDs and directories are managed via Fuser/paths.py.
Subgraph Extractor (Fuser/subgraph_extractor.py): prompts the LLM to emit a JSON array describing each unique subgraph, including ops, shapes, dtypes, and parameter tensors. Entries are deduplicated by shape signature so the dispatcher can reuse kernels.
Dispatcher (Fuser/dispatch_kernel_agent.py): converts each JSON item into a precise Triton generation spec, then spins up TritonKernelAgent processes in parallel. Each worker writes its own session directory with the candidate kernel, test harness, and verification logs.
TritonKernelAgent (triton_kernel_agent/): manages a pool of verification workers (worker.py, manager.py). Each worker iteratively asks an LLM for improvements, executes unit tests under sandboxed subprocesses (Fuser/runner.py), and enforces strict bans on PyTorch fallbacks. A run succeeds only when the test prints PASS (or the sentinel string) and exits with status 0.
Composer (Fuser/compose_end_to_end.py): stitches the verified kernels back into a single Triton program. The composed file contains one or more @triton.jit kernels plus a kernel_function(...) wrapper and a self-test that replays the original PyTorch problem. With --verify, the test is executed immediately and must succeed.

End-to-End Kernel Optimization Workflows

KernelAgent includes a hardware-guided optimization pipeline that iteratively improves a verified Triton kernel's performance using GPU profiling feedback.

Profile — NCU collects 28 hardware metrics (compute utilization, memory bandwidth, cache hit rates, occupancy, stall breakdowns)
Roofline Analysis — Classifies the kernel as memory-bound, compute-bound, or underutilized based on SOL (speed-of-light) percentages
Bottleneck Diagnosis — An LLM analyzes the NCU metrics + kernel code to identify root causes and recommend specific fixes
Optimization — An LLM generates an optimized kernel applying the recommended fixes
Verification — The optimized kernel is tested for numerical correctness against PyTorch reference
Benchmarking — CUDA event timing measures the new kernel, tracking best-so-far with divergence-based revert

The loop runs for up to N rounds, with early termination when the kernel reaches roofline (≥95% SOL) or when performance converges.

Usage

Gradio UI

python scripts/optimization_ui.py --port 8085

Programmatic API

Optimize a kernel using beam search — parallel exploration with top-N kernels and M bottleneck directions:

cd examples && python run_opt_manager.py \
  --kernel-dir optimize_01_matvec/ \
  --strategy beam_search \
  --max-rounds 5

Key Components

| Component | Location | Role | |---|---|---| | OptimizationOrchestrator | triton_kernel_agent/opt_worker_component/orchestrator/ | Main optimization loop | | KernelProfiler | triton_kernel_agent/opt_worker_component/profiling/ | NCU hardware profiling | | BottleneckAnalyzer | triton_kernel_agent/opt_worker_component/prescribing/ | LLM-based bottleneck diagnosis | | RooflineAnalyzer | kernel_perf_agent/kernel_opt/roofline/ | SOL classification and early stopping | | Benchmark | triton_kernel_agent/opt_worker_component/benchmarking/ | CUDA event timing |

Optimization Artifacts

.optimize/workers/<worker_id>/<run_id>/artifacts
  kernel_round_0.py

KernelAgent

Install / Use

README