SkillAgentSearch skills...

LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Install / Use

/learn @xlite-dev/LeetCUDA

README

<div align="center"> <p align="center"> <h2>📚 LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners 🐑</h2> <img src='https://github.com/user-attachments/assets/b2578723-b7a7-4d8f-bcd1-5008947b808a' width="360" height="56" > <a href="https://hellogithub.com/repository/98348655a96640ca8ddcbc298edc901d" target="_blank"><img src="https://api.hellogithub.com/v1/widgets/recommend.svg?rid=98348655a96640ca8ddcbc298edc901d&claim_uid=ofSCbzTmdeQk3FD&theme=dark" alt="Featured|HelloGitHub" style="width: 250px; height: 54px;" width="250" height="54" /></a> </p> <div align='center'> <img src=https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg > <img src=https://img.shields.io/badge/Language-CUDA-brightgreen.svg > <img src=https://img.shields.io/github/forks/xlite-dev/LeetCUDA.svg?style=dark > <img src=https://img.shields.io/github/stars/xlite-dev/LeetCUDA.svg?style=dark > <img src=https://img.shields.io/badge/License-GPLv3.0-turquoise.svg > </div> </div>

📚 LeetCUDA: It includes Tensor/CUDA Cores, TF32/F16/BF16/F8, 📖200+ CUDA Kernels🔥 with PyTorch, 📖100+ LLM/CUDA🔥 blogs, 📖HGEMM⚡️ which can achieve 98%~100% TFLOPS of cuBLAS, and 📖flash-attn⚡️ using Tensor Cores with pure MMA PTX. ♥️ Please consider to leave a ⭐️ Star to support me, my bro ~ ♥️

<div align="center"> <p align="center"> <a href="#contribute">🔥🔥 PR Welcome: Add Your Kernel to LeetCUDA! Let's make it Awesome together! 🎉🎉</a> <br> <a href=https://github.com/xlite-dev/LeetCUDA/graphs/contributors > <img src=https://opencollective.com/leetcuda/contributors.svg height=40px > </a> </p> </div>

©️Citations🎉🎉

@misc{LeetCUDA@2025,
  title={LeetCUDA: A Modern CUDA Learn Notes with PyTorch for Beginners},
  url={https://github.com/xlite-dev/LeetCUDA.git},
  note={Open-source software available at https://github.com/xlite-dev/LeetCUDA.git},
  author={DefTruth and Many Others},
  year={2025}
}

📖 News 🔥🔥

<div id="news"></div>
  • [2026/03] Cache-DiT 🎉v1.3.0 release is ready, the major updates including: Ring Attention w/ batched P2P, USP (Hybrid Ring and Ulysses), Hybrid 2D and 3D Parallelism (💥USP + TP), VAE-P Comm overhead reduce.

arch

<div align='center'> <img height="320px" alt="image" src="https://github.com/user-attachments/assets/ed30185b-2e11-4293-832f-43e9003d6ad9" /> </div>
  • [2024/12]: ⚡️HGEMM is released! Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, achieve peak🎉 performance.

📖 Contents

<div id="contents"></div>

📖 HGEMM Benchmark 🎉🎉

<div id="HGEMM-bench"></div>

Currently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores algorithm, the HGEMM (WMMA/MMA/CuTe) in this repo (blue🔵) can achieve 98%~100% of its (orange🟠) performance. Please check toy-hgemm library⚡️⚡️ or HGEMM⚡️⚡️ repo for more details.

toy-hgemm-library

|📚Feature |📚Feature |📚Feature |📚Feature| |:---:|:---:|:---:|:---:| |✔️CUDA/Tensor Cores|✔️Loop over K|✔️Tile Block(BMxBK)|✔️Tile Threads(T 8x8)| |✔️WMMA(m16n16k16)|✔️MMA(m16n8k16)|✔️Pack LDST(128 bits)|✔️SMEM Padding| |✔️Copy Async|✔️Tile MMAs|✔️Tile Warps|✔️Multi Stages(2~4)| |✔️Register Double Buffers|✔️Block Swizzle|✔️Warp Swizzle|✔️SMEM Swizzle(CuTe/MMA)| |✔️Collective Store(Shfl)|✔️Layout NN|✔️Layout TN|✔️SGEMM FP32/TF32|

📖 FA2-MMA Benchmark 🎉🎉

<div id="fa-mma-bench"></div>

I have also implemented FlashAttention-2 using pure MMA PTX instructions, which supports features such as Multi-Stages, Tile MMA, Tile Warp, Shared KV SMEM, Fully Shared QKV SMEM, Prefetch Q s2r, Prefetch K/V g2s, QKV Fine-grained Tiling, Collective Store, etc. Please refer to flash-attn⚡️⚡️ for more details.

flash-attn-mma

|📚Feature |📚Feature |📚Feature |📚Feature| |:---:|:---:|:---:|:---:| |✔️Tensor Cores|✔️Loop over N/D |✔️Tile Block(Br, Bc)|✔️MMA(m16n8k16)| |✔️Pack LDST(128 bits)|✔️SMEM Swizzle/Padding |✔️Copy Async|✔️Tile MMAs| |✔️Tile Warps|✔️Multi Stages(1/2)|✔️Collective Store(Shfl)|✔️Split KV/Q| |✔️Shared QKV SMEM|✔️Prefetch Q s2r|✔️Prefetch KV g2s|✔️QKV Fine-grained Tiling|

Currently, for small-scale attention (B<=4, H <=48, SeqLen <= 8192, D <= 64) it can run faster than FA2/SDPA on some Devices. For example, on NVIDIA RTX 3080 Laptop, 📚 Split Q + Fully Shared QKV SMEM method can achieve 55 TFLOPS (D=64) that almost ~1.5x 🎉 faster than FA2. On NVIDIA L20, 🤖ffpa-attn method can achieve 104 TFLOPS (D=512) that almost ~1.8x 🎉 faster than SDPA (EFFICIENT ATTENTION). However, for large-scale attention, there remains a performance gap. Stay tuned for updates ~ (MMA Acc F16/F32, softmax Acc F32 vs FA2 MMA/softmax Acc F32, 👇Benchmark)

|Algorithm| (B,H,N,D) | RTX 3080 Laptop | L20 | RTX 4090 | |:---:|:---:|:---:|:---:|:---:| |FlashAttention-2|(1,8,8192,64)|37 TFLOPS|100 TFLOPS|145 TFLOPS| |share-qkv+stage2|(1,8,8192,64)|55 TFLOPS|99 TFLOPS|221 TFLOPS| |FlashAttention-2|(1,48,8192,64)|37 TFLOPS|109 TFLOPS|163 TFLOPS| |share-qkv+stage2|(1,48,8192,64)|48 TFLOPS|107 TFLOPS|224 TFLOPS| |SDPA(EFFICIENT ATTENTION)|(1,48,8192,512)|16 TFLOPS|58 TFLOPS|85 TFLOPS| |🤖ffpa-attn|(1,48,8192,512)|39 TFLOPS|104 TFLOPS|200 TFLOPS| |Precision Errors vs FA2/SDPA| / | max: < ~1e-3 | min: ~0.0 | mean: < ~1e-5 |

The Split KV and Split Q implementations have been carried out in flash-attn⚡️⚡️ for performance comparison. The Split KV method, which involves splitting all QKV across MMA (Warps), is slower than Split Q method, which splitting Q across MMA(Warps) and keep access KV for all MMA(Warps).

  • 📚 Split KV (Basic, FlashAttention-1)
<div id="mma-split-kv"></div>
// Split QKV across MMA(Warps) using naive matmul MMA&Warp tiling policy.
// case: The layout of 8 MMA(2x4)  [after] kWarpTileSeqLenQxkWarpTileSeqLenK(2x2) -> 32x2,32x2=64x64:
// |  [64,64]  |    warp_KV 0    |    warp_KV 1    |    warp_KV 2    |    warp_KV 3    |
// | warp_QP 0 |-- MMA 0,MMA 0 --|-- MMA 2,MMA 2 --|-- MMA 4,MMA 4 --|-- MMA 6,MMA 6 --|
// | warp_QP 0 |-- MMA 0,MMA 0 --|-- MMA 2,MMA 2 --|-- MMA 4,MMA 4 --|-- MMA 6,MMA 6 --|
// | warp_QP 1 |-- MMA 1,MMA 1 --|-- MMA 3,MMA 2 --|-- MMA 5,MMA 5 --|-- MMA 7,MMA 7 --|
// | warp_QP 1 |-- MMA 1,MMA 1 --|-- MMA 3,MMA 2 --|-- MMA 5,MMA 5 --|-- MMA 7,MMA 7 --|
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_kv_kernel(half* Q, half* K, half* V, half* O, ...);
  • 📚 Split Q (Faster, FlashAttention-2)
<div id="mma-split-q"></div>
// Split Q across MMA(Warps) and keep access KV for all MMA(Warps),
// in order to reduce the comm between warps via smem and warp shuffle.
// case: MMA = m16n8k16, Br=16x4=64, Bc=8x8=64, layout: 4 warps
// |   64x64   |      warp_KV 0       |
// | warp_QP 0 | MMA 0 ... MMA 0 (x8) |
// | warp_QP 1 | MMA 1 ... MMA 1 (x8) |
// | warp_QP 2 | MMA 2 ... MMA 2 (x8) |
// | warp_QP 3 | MMA 3 ... MMA 3 (x8) |
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_kernel(half* Q, half* K, half* V, half* O, ...);
  • 📚 Split Q + Shared KV SMEM (1/2 SRAM vs FA2)
<div id="mma-share-kv"></div>
// K, V shared the same shared memory, improve block occupancy.
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_shared_kv_kernel(half* Q, half* K, half* V, half* O, ...);
  • 📚 Split Q + Fully Shared QKV SMEM (1/4 SRAM vs FA2)
<div id="mma-share-qkv"></div>
// Q, K, V fully shared the same shared memory and prefetch Q s2r, improve block occupancy
// and reduce Q SMEM IO-Access.
__global__ void // Q, K, V, O -> [B, H, N, D]
flash_attn_mma_stages_split_q_shared_qkv_kernel(half* Q, half* K, half* V, half* O, ...

Related Skills

View on GitHub
GitHub Stars10.0k
CategoryDevelopment
Updated1h ago
Forks1.0k

Languages

Cuda

Security Score

100/100

Audited on Mar 27, 2026

No findings