ALT

Overview

CUTLASS 4.4.2

CUTLASS 4.4.2 - March 2026

CUTLASS is a collection of abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement. CUTLASS decomposes these "moving parts" into reusable, modular software components and abstractions.

Primitives for different levels of a conceptual parallelization hierarchy can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The resulting flexibility simplifies their use as building blocks within custom kernels and applications.

CUTLASS has been providing CUDA C++ template abstractions for high-performance linear algebra since 2017 and these abstractions provide extensive support for a wide range of computations including mixed-precision computations, specialized data-movement (async copy) and multiply-accumulate abstractions for FP64, FP32, TF32, FP16, BF16, FP32 emulation via tensor core instruction, 8b floating point types (e5m2 and e4m3), block scaled data types (NVIDIA NVFP4 and OCP standard MXFP4, MXFP6, MXFP8), narrow integer types (4 and 8b signed and unsigned integers), and binary 1b data types (where architectures allow for the native support of such data types) across NVIDIA's Volta, Turing, Ampere, Ada, Hopper, and Blackwell architectures.

To this rich ecosystem of C++ based kernel programming abstractions, CUTLASS 4 adds CUTLASS DSLs. These are Python native interfaces for writing high-performance CUDA kernels based on core CUTLASS and CuTe concepts without any performance compromises. This allows for a much smoother learning curve, orders of magnitude faster compile times, native integration with DL frameworks without writing glue code, and much more intuitive metaprogramming that does not require deep C++ expertise.

Overall we envision CUTLASS DSLs as a family of domain-specific languages (DSLs). With the release of 4.0, we are releasing the first of these in CuTe DSL. This is a low level programming model that is fully consistent with CuTe C++ abstractions — exposing core concepts such as layouts, tensors, hardware atoms, and full control over the hardware thread and data hierarchy.

CuTe DSL demonstrates optimal matrix multiply and other linear algebra operations targeting the programmable, high-throughput Tensor Cores implemented by NVIDIA's Ampere, Hopper, and Blackwell architectures.

We believe it will become an indispensable tool for students, researchers, and performance engineers alike — flattening the learning curve of GPU programming, rapidly prototyping kernel designs, and bringing optimized solutions into production.

CuTe DSL is currently in public beta and will graduate out of beta by end of summer 2025.

To get started quickly - please refer :

What's New in CUTLASS 4.4

CuTe DSL

New features
- CuTe DSL now supports CUDA toolkit 13.1!
  - Set up with cutlass/python/CuTeDSL/setup.sh --cu13
  - Refer to https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/quick_start.html for more details
- GB300 is now supported in CuTe DSL with CTK 13.1
  - Refer to SM103 batched 3xFP4 blockscaled GEMM kernel for example kernel
- cute.experimental: introduce a higher-level, composable layer on top of existing CuTe DSL APIs (not a separate abstraction), which can be mixed with existing Cute DSL building blocks.
  - Fragment-free programming model: copy/dot APIs take memrefs directly instead of descriptors/fragments.
  - Automatic TMA descriptor generation and update insertion.
  - Automatic vectorization and predication for SIMT copies.
  - New pipeline abstraction with convenience wrappers
  - New Partition ops to simplify partitioning logic.
  - Device-side TMA descriptor allocation, initialization, and management
  - These examples can be found here https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/experimental
- Ahead of Time (AoT) compilation is now available!
  - Refer to files under https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/cute/export for example usage
- JAX support - you can now use CuTeDSL along with JAX
  - Refer to files under https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/jax for example usage
- Introduced versioning support in DSL:
  - cutlass.version for a string representation of DSL version
  - cutlass.CUDA_VERSION for a version class to tell the CUDA version used for DSL
- Added CopyDsmemStoreOp to store data to distributed shared memory with explicit synchronization.
- Grouped GEMM example now supports device-only problem shapes.
- We allow grid carve-out without problem shapes being available on host.
- Tma+LdMatrix features for loading+unpacking narrow-width types (refer to mixed_input_fmha_decode.py for example usage).
- It is possible now to have customized epilogue fusion for persistent dense GEMM through a Python Epilogue Fusion Configuration (EFC) function, somewhat similar to CUTLASS C++ EVT. It also provides a PyTorch evaluator to compare the results.
- CuTe DSL now supports Python 3.14 for both x86_64 and aarch64
- Runtime Pointer/Tensor/FakeTensor now supports cache_key, providing a stable, hashable representation that simplifies and improves compiled function caching.
More examples of authorizing peak-performance kernels
- SM103 batched 3xFP4 blockscaled GEMM kernel
- Mixed input FMHA decode example with support for int4 KV (int8 KV supported in 4.3)
- New acc_scale grouped mixed input gemm kernel variant is introduced to deliver better performance for decoding cases.
- All mixed_input_gemm examples are moved into a separate folder mixed_input_gemm. Common utility functions are also extracted into mixed_input_host_utils.py under the same folder.
Bug fixing and improvements
- Fixed an issue that both branches of if are executed
- Fixed cute.printf with f-string
- Fixed an indexing issue of scalar tensor
- Fixed small K reference check error for cta_tile_n = 256 case with overlapping accumulator optimization in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed a segfault issue with tvm-ffi on aarch64
- Fixed Hopper FMHA causal attention performance regression on CUDA toolkit 13.1 by optimizing mbarrier synchronization to avoid unnecessary convergence barriers.
- Fix kernel loading race condition when multiple GPU are present in the same process in JAX.
API changes
- Deprecate get_num_tmem_alloc_cols from blackwell_helpers.py. Use the one from tmem_allocator.py instead.
- Deprecate SM100_TMEM_CAPACITY_COLUMNS and SM100_TMEM_MIN_ALLOC_COLUMNS.
- LdMatrix16x16x8bOp and StMatrix16x8x8bOp now require explicit transpose=True when calling init, to avoid ambiguity in data transposition.
- LdMatrix16x16x8bOp copy traits updated to be faithful to PTX without permutations. Permuted variant is renamed to LdMatrix16x8x8bOp.
- Grouped GEMM example takes the argument --host_problem_shape_available. If the argument is provided, grid is carved out based upon the host problem shapes, otherwise, we launch maximum possible SMs.
- hardware_info.get_max_active_cluster support pass in specific stream to query. Useful for green context based SM partition.
- group_bulk_copy_modes in async bulk copy example is now deprecated, use group_modes directly instead.
- Deprecate nvvm wrapper from using nvvm enum, use str instead.
- cute.arch.calc_packed_f32x2_op default enable ftz to default disable ftz
- In CuTe DSL with CTK 13.1, following APIs in cutlass.cute.arch now require string literal instead of enum as argument:
  - fence_proxy
  - fence_view_async_tmem_op
  - calc_packed_f32x2_op
  - warp_redux_sync
  - atomic_add
  - atomic_and
  - atomic_or
  - atomic_xor
  - atomic_max
  - atomic_min
  - atomic_exch
  - atomic_cas
  - store
  - load
Use 'Advanced control file' for mixed input gemm examples for better performance.
- Advanced control file is an experimental feature of CUDA compiler. The controls file contains internal compiler settings tuned for specific kernels with a specific version of CUDA toolkit to get better GPU kernel code. More details and documentation on how to create these controls files will be provided in future CUDA toolkit release. Note: The advanced compiler control file is not expected to work for kernels that it was not tuned for. There is no compatibility guarantee, and the controls file will not work for CUDA toolkit with a different version.

CUTLASS C++

Add example 93 for Blackwell low latency generation phase GQA kernel.
- Flash Decoding with cluster reduction.
- Kernel design details please check Readme.
Add Blackwell SM100 State Space Decomposition (SSD) kernel in example 112.
Add Hopper SM90 State Space

Cutlass

Install / Use

README

Overview

CUTLASS 4.4.2

What's New in CUTLASS 4.4

CuTe DSL

CUTLASS C++