CUTLASS Academy

What is CUTLASS?

CUTLASS (CUDA Templates for Linear Algebra Subroutines) is a collection of CUDA C++ templates and abstractions for implementing high-performance matrix-multiplication and related computations at all levels and scales within CUDA. CUTLASS provides:

Threadblock-level abstractions for matrix multiply-accumulate operations
Warp-level primitives for matrix multiply-accumulate operations
Epilogue components for various activation functions and tensor operations
Utilities for efficiently loading and storing tensors in memory

CUTLASS is designed to deliver high performance for deep learning and HPC applications, with a focus on matrix multiplication operations that are fundamental to neural networks.

What is CUTE?

CUTE (CUDA Template Library for Tensors) is a modern C++ library built on top of CUTLASS that provides a more flexible and composable approach to tensor operations. CUTE introduces:

A unified tensor abstraction that works across different hardware levels
Powerful layout mapping capabilities for tensors
Composable building blocks for tensor algorithms
A more intuitive programming model for complex tensor operations

CUTE was introduced in CUTLASS 3.0 and represents a significant evolution in NVIDIA's approach to tensor computing.

How do CUTLASS, CUTE, and CUDA relate?

CUDA is the base programming model and platform for NVIDIA GPUs. It provides the fundamental parallel computing architecture and programming interface.
CUTLASS is a library built on top of CUDA that provides optimized implementations of matrix operations.
CUTE is a higher-level abstraction built on top of CUTLASS that simplifies tensor programming.

Key Differences

| Feature | CUDA | CUTLASS | CUTE | |---------|------|---------|------| | Level of Abstraction | Low-level GPU programming | Matrix operation templates | High-level tensor abstractions | | Focus | General GPU computing | Matrix multiplication primitives | Flexible tensor operations | | Programming Model | Explicit thread/block management | Threadblock/warp abstractions | Layout-focused tensor abstractions | | Optimization Control | Manual | Template-based | Layout-driven |

Resources

Documentation

CUTLASS Docs

CUTLASS GitHub Repository
CUTLASS Wiki Documentation (great markdowns with examples and images)
CUTLASS Documentation

CUTE Docs

GTC

GTC 2025 Coming soon (Cutlass in python coming soon !)

Articles

PyTorch

Deep Dive on CUTLASS Ping-Pong GEMM Kernel

Nvidia

Colfax

Miscellaneous

Build and Develop CUTLASS CUDA Kernels (How to create a CUDA Docker container for CUTLASS kernel development)
learn-cutlass

CutlassAcademy

Install / Use

README