Helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
Install / Use
/learn @pytorch/HelionREADME
Events
- April 7, 2026: Helion General Availability Launch, Helion 1.0: A High-Level DSL for Performance-Portable Kernels @ PyTorch Conference Europe 2026, Paris, France
- April 7, 2026: Meetup, Meet the Developers of Helion @ PyTorch Conference Europe 2026, Paris, France
- June 15, 2026: Helion Tutorial, Writing Performance-Portable Kernels Simplified with Helion @ PLDI 2026, Boulder, CO
About
📚 View Documentation 📚 | 🎥 Watch Talk 🎥 | 🚀 Try In Colab 🚀 | Try In AMD DevCloud
Helion is a Python-embedded domain-specific language (DSL) for authoring machine learning kernels, designed to compile down to Triton, a performant backend for programming GPUs and other devices. Helion aims to raise the level of abstraction compared to Triton, making it easier to write correct and efficient kernels while enabling more automation in the autotuning process.
The name Helion refers to the nucleus of a helium-3 atom, while Triton refers to hydrogen-3.
Helion can be viewed either as PyTorch with tiles or as a higher-level Triton. Compared to Triton, Helion reduces manual coding effort through autotuning. Helion spends more time (approx 10 min) autotuning as it evaluates hundreds of potential Triton implementations generated from a single Helion kernel. This larger search space also makes kernels more performance portable between different hardware. Helion automates and autotunes over:
-
Tensor Indexing:
- Automatically calculates strides and indices.
- Autotunes choices among various indexing methods (pointers, block pointers, TensorDescriptors).
- Supports per-operation indexing strategies for fine-grained memory access control of loads and stores.
-
Masking:
- Most masking is implicit in Helion, and is optimized away when not needed.
-
Grid Sizes and PID Calculations:
- Automatically determines grid sizes.
- Autotunes multiple mappings from Program IDs (PIDs) to data tiles.
-
Implicit Search Space Definition:
- Eliminates the need to manually define search configurations.
- Automatically generates configuration flags and exploration spaces.
-
Kernel Arguments Management:
- Automates the handling of kernel arguments, including tensor sizes and strides.
- Lifts global variables and (nested) closures into kernel arguments, allowing better templating.
-
Looping Reductions:
- Can automatically convert large reductions into looped implementations.
-
Automated Optimizations:
- PID swizzling for improved L2 cache reuse.
- Loop reordering.
- Persistent kernel strategies.
- Warp specialization choices, unrolling, and more.
Example
A minimal matrix multiplication kernel in Helion looks like this:
import torch, helion, helion.language as hl
@helion.kernel()
def matmul(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
m, k = x.size()
k, n = y.size()
out = torch.empty([m, n], dtype=x.dtype, device=x.device)
for tile_m, tile_n in hl.tile([m, n]):
acc = hl.zeros([tile_m, tile_n], dtype=torch.float32)
for tile_k in hl.tile(k):
acc = torch.addmm(acc, x[tile_m, tile_k], y[tile_k, tile_n])
out[tile_m, tile_n] = acc
return out
The code outside the for loops is standard PyTorch code executed on
the CPU. It is typically used for tasks like allocating output tensors
and performing shape computations.
The code inside the for loops is compiled into a Triton kernel,
resulting in a single GPU kernel. A single Helion kernel is always
compiled to exactly one GPU kernel.
The hl.tile function subdivides the iteration space (in this case m by
n) into tiles. These tiles are executed in parallel on the GPU. Tiling
details, such as dimensionality (1D vs 2D), tile sizes, and loop ordering,
are automatically determined by Helion's autotuner. Alternatively, these
details can be explicitly specified using the config= argument in
helion.kernel.
-
The outer
forloop is mapped onto the grid of the generated kernel. The grid size is determined automatically based on the chosen tile size. -
The inner
forloop translates into a loop within the generated kernel, and its tile size is also determined automatically.
Within a Helion kernel, standard PyTorch operators (like
torch.addmm) are automatically mapped to Triton operations using
TorchInductor.
Thus, familiarity with PyTorch means you already know most of
Helion. Helion supports a wide range of operations including pointwise
(add, sigmoid, etc.), reductions (sum, softmax, etc.), views,
and matrix multiplication operations. Arbitrary function calls
within a Helion kernel are supported, but must be traceable with
make_fx.
Autotuning
The above example can be executed with:
out = matmul(torch.randn([2048, 2048], device="cuda"),
torch.randn([2048, 2048], device="cuda"))
When a kernel runs for the first time, Helion initiates autotuning. A typical autotuning session produces output similar to:
[0s] Starting DifferentialEvolutionSearch with population=40, generations=20, crossover_rate=0.8
[20s] Initial population: failed=4 min=0.0266 mid=0.1577 max=1.2390 best=Config(block_sizes=[64, 32, 64], loop_orders=[[1, 0]], l2_groupings=[8], range_unroll_factors=[3, 1], range_warp_specializes=[True, False], range_num_stages=[1, 0], range_multi_buffers=[True, True], range_flattens=[None, False], num_warps=4, num_stages=7, indexing='block_ptr', pid_type='persistent_blocked')
[51s] Generation 2: replaced=17 min=0.0266 mid=0.0573 max=0.1331 best=Config(block_sizes=[64, 32, 64], loop_orders=[[1, 0]], l2_groupings=[8], range_unroll_factors=[3, 1], range_warp_specializes=[True, False], range_num_stages=[1, 0], range_multi_buffers=[True, True], range_flattens=[None, False], num_warps=4, num_stages=7, indexing='block_ptr', pid_type='persistent_blocked')
[88s] Generation 3: replaced=18 min=0.0225 mid=0.0389 max=0.1085 best=Config(block_sizes=[64, 64, 16], loop_orders=[[0, 1]], l2_groupings=[4], range_unroll_factors=[0, 1], range_warp_specializes=[None, None], range_num_stages=[0, 0], range_multi_buffers=[None, False], range_flattens=[None, None], num_warps=4, num_stages=6, indexing='pointer', pid_type='flat')
...
[586s] Generation 19: replaced=3 min=0.0184 mid=0.0225 max=0.0287 best=Config(block_sizes=[64, 64, 64], loop_orders=[[0, 1]], l2_groupings=[4], range_unroll_factors=[0, 1], range_warp_specializes=[None, False], range_num_stages=[0, 3], range_multi_buffers=[None, False], range_flattens=[None, None], num_warps=8, num_stages=6, indexing='block_ptr', pid_type='flat')
[586s] Autotuning complete in 586.6s after searching 1520 configs.
One can hardcode the best config and skip autotuning with:
@helion.kernel(config=helion.Config(block_sizes=[64, 64, 64], loop_orders=[[0, 1]], l2_groupings=[4], range_unroll_factors=[0, 1], range_warp_specializes=[None, False], range_num_stages=[0, 3], range_multi_buffers=[None, False], range_flattens=[None, None], num_warps=8, num_stages=6, indexing='block_ptr', pid_type='flat'))
Because autotuning can be time-consuming (around 10 minutes in the above example), you may want to manually specify the best configuration found from autotuning to avoid repeated tuning:
@helion.kernel(config=helion.Config(
block_sizes=[64, 64, 64],
loop_orders=[[0, 1]],
l2_groupings=[4],
range_unroll_factors=[0, 1],
range_warp_specializes=[None, False],
range_num_stages=[0, 3],
range_multi_buffers=[None, False],
range_flattens=[None, None],
num_warps=8,
num_stages=6,
indexing='block_ptr',
pid_type='flat'
))
def matmul(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
...
This explicit configuration skips autotuning on subsequent runs.
You can also specify multiple configurations, prompting Helion to perform a more lightweight autotuning process:
@helion.kernel(configs=[
helion.Config(...),
helion.Config(...),
])
def matmul(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
...
In this case, Helion evaluates the provided configurations and selects the fastest one.
Additionally, Helion provides programmatic APIs to manage autotuning and configurations directly from your code.
For production deployment, we recommend using ahead-of-time tuned configurations rather than relying on runtime autotuning. The autotuning process can be time-consuming and resource-intensive, making it unsuitable for production environments where predictable performance and startup times are critical.
Static shapes and autotuning keys
By default Helion uses static shapes (static_shapes=True). This means each unique input shape/stride signature is treated as its own specialization and will be autotuned separately. This typically yields the best performance, but may increase autotuning time when many shapes are encountered.
If you want to reduce autotuning time by sharing configurations between different shapes, set static_shapes=False. In this mode, the autotuning
Related Skills
node-connect
339.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.9kCommit, push, and open a PR
