Lectures
Material for gpu-mode lectures
Install / Use
/learn @gpu-mode/LecturesREADME
Supplementary Material for Lectures
The PMPP Book: Programming Massively Parallel Processors: A Hands-on Approach (Amazon link)
Lecture 1: Profiling and Integrating CUDA kernels in PyTorch
- Speaker: Mark Saroufim
- Notebook and slides in lecture_001 folder
Lecture 2: Recap Ch. 1-3 from the PMPP book
- Speaker: Andreas Koepf
- Slides: The powerpoint file lecture_002/cuda_mode_lecture2.pptx can be found in the root directory of this repository. Alternatively here as Google docs presentation.
Lecture 3: Getting Started With CUDA
- Speaker: Jeremy Howard
- Notebook: See the lecture_003 folder, or run the Colab version
Lecture 4: Intro to Compute and Memory Architecture
- Speaker: Thomas Viehmann
- Notebook and slides in the lecture_004 folder.
Lecture 5: Going Further with CUDA for Python Programmers
- Speaker: Jeremy Howard
- Notebook in the lecture_005 folder.
Lecture 6: Optimizing PyTorch Optimizers
Lecture 7: Advanced Quantization
- Speaker: Charles Hernandez
- Slides
Lecture 8: CUDA Performance Checklist
- Speaker: Mark Saroufim
- Code in the lecture_008 folder
- Slides
Lecture 9: Reductions
- Speaker: Mark Saroufim
- Code in the lecture_009 folder
- Slides
Lecture 10: Build a Prod Ready CUDA Library
- Speaker: Oscar Amoros Huguet
- slides
Lecture 11: Sparsity
Lecture 12: Flash Attention
- Speaker: Thomas Viehmann
- Code in the lecture_012 folder
Lecture 13: Ring Attention
- Speaker: Andreas Koepf
- Slides
Lecture 14: Practitioner's Guide to Triton
Lecture 15: CUTLASS
- Speaker: Eric Auld
Lecture 16: On Hands profiling
- Speaker: Taylor Robbie
Bonus Lecture: CUDA C++ llm.cpp
- Speaker: Jake Hemstad & Georgii Evtushenko
- Slides
Lecture 17: GPU Collective Communication (NCCL)
- Speaker: Dan Johnson
- Code in the lecture_017 folder
Lecture 18: Fused Kernels
- Speaker: Kapil Sharma
- Code in the lecture_018 folder
Lecture 19: Data Processing on GPUs
- Speaker: Devavret Makkar
Lecture 20: Scan Algorithm
- Speaker: Izzat El Haj
- Slides
Lecture 21: Scan Algorithm Part 2
- Speaker: Izzat El Haj
- Slides
Lecture 22: Hacker's Guide to Speculative Decoding in VLLM
- Speaker: Cade Daniel
- Slides
Lecture 23: Tensor Cores
- Speaker: Vijay Thakkar & Pradeep Ramani
- Slides
Lecture 24: Scan at the Speed of Light
- Speaker: Jake Hemstad & Georgii Evtushenko
Lecture 25: Speaking Composable Kernel
- Speaker: Haocong Wang
- Slides
Lecture 26: SYCL MODE (Intel GPU)
- Speaker: Patric Zhao
- Slides
Lecture 27: gpu.cpp
- Speaker: Austin Huang
- Slides
Lecture 28: Liger Kernel
Lecture 29: Triton Internals
- Speaker: Kapil Sharma
- Code/presentation in the lecture_029 folder
Lecture 30: Quantized training
- Speaker: Thien Tran
- Code/presentation in the lecture_030 folder
Lecture 31: Beginners Guide to Metal Kernels
- Speaker: Nikita Shulga
- Code/presentation in the lecture_031 folder
Lecture 32: Unsloth - LLM Systems Engineering
- Speaker: Daniel Han
- Slides
Lecture 33: BitBLAS
- Speaker: Wang Lei
- Code/presentation in the lecture_033 folder
Lecture 34: Low Bit Triton Kernels
- Speaker: Hicham Badri
- Slides
Lecture 35: SGLang Performance Optimization
- Speaker: Yineng Zhang
- Slides
Lecture 36: CUTLASS and Flash ATtention 3
Lecture 37: Introduction to SASS & GPU Microarchitecture
- Speaker: Arun Demeure
- Slides
Lecture 38: Lowbit kernels for ARM CPU
Lecture 39: TorchTitan
- Speaker: Mark Saroufim and Tianyu Liu
Lecture 40: Flash Infer
- Speaker: Zihao Ye
Lecture 41: CUDA Docs for Humans
- Speaker: Charles Frye
- Slides
Lecture 42: Mosaic GPU
- Speaker: Adam Paszke
Lecture 43:
- Speaker: Erik Schultheis
- Slides
Lecture 57: CuTE
- Speaker: Cris Cecka
- Slides
Lecture 67: NCCL & NVSHMEM
Lecture 69: Quartet 4 bit training
- Speakers: Roberto Castro and Andrei Panferov
- Code: https://github.com/IST-DASLab/Quartet and https://github.com/isT-DASLab/qutlass Roberto Castro and Andrei Panferov
- Paper
Lecture 70: Fault tolerant communication collectives
- Speaker: mike64_t
- Slides
Lecture 71: [ScaleML Series] FlexOlmo: Open Language Models for Flexible Data Use
Lecture 72: [ScaleML Series] Efficient & Effective Long-Context Modeling for Large Language Models
- Speaker: Guangxuan Xiao
- Slides
Lecture 74: [ScaleML Series] Positional Encodings and PaTH Attention
- Speaker: Songlin Yang
- Slides
Lecture 75: [ScaleML Series] GPU Programming Fundamentals + ThunderKittens
- Speaker 1: William Brandon
- Speaker 2:
Related Skills
node-connect
344.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
99.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
