CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
Install / Use
/learn @MekkCyber/CutlassAcademyREADME
CUTLASS Academy
What is CUTLASS?
CUTLASS (CUDA Templates for Linear Algebra Subroutines) is a collection of CUDA C++ templates and abstractions for implementing high-performance matrix-multiplication and related computations at all levels and scales within CUDA. CUTLASS provides:
- Threadblock-level abstractions for matrix multiply-accumulate operations
- Warp-level primitives for matrix multiply-accumulate operations
- Epilogue components for various activation functions and tensor operations
- Utilities for efficiently loading and storing tensors in memory
CUTLASS is designed to deliver high performance for deep learning and HPC applications, with a focus on matrix multiplication operations that are fundamental to neural networks.
What is CUTE?
CUTE (CUDA Template Library for Tensors) is a modern C++ library built on top of CUTLASS that provides a more flexible and composable approach to tensor operations. CUTE introduces:
- A unified tensor abstraction that works across different hardware levels
- Powerful layout mapping capabilities for tensors
- Composable building blocks for tensor algorithms
- A more intuitive programming model for complex tensor operations
CUTE was introduced in CUTLASS 3.0 and represents a significant evolution in NVIDIA's approach to tensor computing.
How do CUTLASS, CUTE, and CUDA relate?
- CUDA is the base programming model and platform for NVIDIA GPUs. It provides the fundamental parallel computing architecture and programming interface.
- CUTLASS is a library built on top of CUDA that provides optimized implementations of matrix operations.
- CUTE is a higher-level abstraction built on top of CUTLASS that simplifies tensor programming.
Key Differences
| Feature | CUDA | CUTLASS | CUTE | |---------|------|---------|------| | Level of Abstraction | Low-level GPU programming | Matrix operation templates | High-level tensor abstractions | | Focus | General GPU computing | Matrix multiplication primitives | Flexible tensor operations | | Programming Model | Explicit thread/block management | Threadblock/warp abstractions | Layout-focused tensor abstractions | | Optimization Control | Manual | Template-based | Layout-driven |
Resources
Documentation
CUTLASS Docs
- CUTLASS GitHub Repository
- CUTLASS Wiki Documentation (great markdowns with examples and images)
- CUTLASS Documentation
CUTE Docs
GTC
-
CUTLASS: CUDA TEMPLATE LIBRARY FOR DENSE LINEAR ALGEBRA AT ALL LEVELS AND SCALES (GTC 2018)
-
PROGRAMMING TENSOR CORES: NATIVE VOLTA TENSOR CORES WITH CUTLASS (GTC 2019)
-
Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100 (GTC 2020)
-
Accelerating Convolution with Tensor Cores in CUTLASS (GTC 2021)
-
Accelerating Backward Data Gradient by Increasing Tensor Core Utilization in CUTLASS (GTC 2022)
-
CUTLASS: Python API, Enhancements, and NVIDIA Hopper (GTC 2022)
-
Developing Optimal CUDA Kernels on Hopper Tensor Cores (GTC 2023)
-
CUTLASS: A Performant, Flexible, and Portable Way to Target Hopper Tensor Cores (GTC 2024)
GTC 2025 Coming soon (Cutlass in python coming soon !)
Articles
PyTorch
Nvidia
- Implementing High Performance Matrix Multiplication Using CUTLASS v2.8
- CUTLASS: Fast Linear Algebra in CUDA C++
Colfax
- CUTLASS Tutorial: Fast Matrix-Multiplication with WGMMA on NVIDIA® Hopper™ GPUs
- Tutorial: Matrix Transpose in CUTLASS
- CUTLASS Tutorial: Persistent Kernels and Stream-K
- CUTLASS Tutorial: Mastering the NVIDIA® Tensor Memory Accelerator (TMA)
- Developing CUDA Kernels for GEMM on NVIDIA Hopper Architecture using CUTLASS
- A note on the algebra of CuTe Layouts
Miscellaneous
- Build and Develop CUTLASS CUDA Kernels (How to create a CUDA Docker container for CUTLASS kernel development)
- learn-cutlass
Videos
- Lecture 15: CUTLASS (GPU MODE)
- CUTLASS: A CUDA C++ Template Library for Accelerating Deep Learning Computations (The Linux Foundation)
- Lecture 36: CUTLASS and Flash Attention 3 (GPU MODE)
- GTC 2024 : CUTLASS: A Performant, Flexible, and Portable Way to Target Hopper Tensor Cores
Repos using CUTLASS/CUTE
Security Score
Audited on Mar 27, 2026
