Lectures

Material for gpu-mode lectures

Generate Convert Improve

Install / Use

/learn @gpu-mode/Lectures

About this skill

Quality Score

0/100

README

Supplementary Material for Lectures

YouTube Channel

The PMPP Book: Programming Massively Parallel Processors: A Hands-on Approach (Amazon link)

Lecture 1: Profiling and Integrating CUDA kernels in PyTorch

Speaker: Mark Saroufim
Notebook and slides in lecture_001 folder

Lecture 2: Recap Ch. 1-3 from the PMPP book

Speaker: Andreas Koepf
Slides: The powerpoint file lecture_002/cuda_mode_lecture2.pptx can be found in the root directory of this repository. Alternatively here as Google docs presentation.

Lecture 3: Getting Started With CUDA

Speaker: Jeremy Howard
Notebook: See the lecture_003 folder, or run the Colab version

Lecture 4: Intro to Compute and Memory Architecture

Speaker: Thomas Viehmann
Notebook and slides in the lecture_004 folder.

Lecture 5: Going Further with CUDA for Python Programmers

Speaker: Jeremy Howard
Notebook in the lecture_005 folder.

Lecture 6: Optimizing PyTorch Optimizers

Speaker: Jane Xu
Slides

Lecture 7: Advanced Quantization

Speaker: Charles Hernandez
Slides

Lecture 8: CUDA Performance Checklist

Speaker: Mark Saroufim
Code in the lecture_008 folder
Slides

Lecture 9: Reductions

Speaker: Mark Saroufim
Code in the lecture_009 folder
Slides

Lecture 10: Build a Prod Ready CUDA Library

Speaker: Oscar Amoros Huguet
slides

Lecture 11: Sparsity

Speaker: Jesse Cai
Slides

Lecture 12: Flash Attention

Speaker: Thomas Viehmann
Code in the lecture_012 folder

Lecture 13: Ring Attention

Speaker: Andreas Koepf
Slides

Lecture 14: Practitioner's Guide to Triton

Date: 2024-04-13, Speaker: Umer Adil
Notebook

Lecture 15: CUTLASS

Speaker: Eric Auld

Lecture 16: On Hands profiling

Speaker: Taylor Robbie

Bonus Lecture: CUDA C++ llm.cpp

Speaker: Jake Hemstad & Georgii Evtushenko
Slides

Lecture 17: GPU Collective Communication (NCCL)

Speaker: Dan Johnson
Code in the lecture_017 folder

Lecture 18: Fused Kernels

Speaker: Kapil Sharma
Code in the lecture_018 folder

Lecture 19: Data Processing on GPUs

Speaker: Devavret Makkar

Lecture 20: Scan Algorithm

Speaker: Izzat El Haj
Slides

Lecture 21: Scan Algorithm Part 2

Speaker: Izzat El Haj
Slides

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

Speaker: Cade Daniel
Slides

Lecture 23: Tensor Cores

Speaker: Vijay Thakkar & Pradeep Ramani
Slides

Lecture 24: Scan at the Speed of Light

Speaker: Jake Hemstad & Georgii Evtushenko

Lecture 25: Speaking Composable Kernel

Speaker: Haocong Wang
Slides

Lecture 26: SYCL MODE (Intel GPU)

Speaker: Patric Zhao
Slides

Lecture 27: gpu.cpp

Speaker: Austin Huang
Slides

Lecture 28: Liger Kernel

Lecture 29: Triton Internals

Speaker: Kapil Sharma
Code/presentation in the lecture_029 folder

Lecture 30: Quantized training

Speaker: Thien Tran
Code/presentation in the lecture_030 folder

Lecture 31: Beginners Guide to Metal Kernels

Speaker: Nikita Shulga
Code/presentation in the lecture_031 folder

Lecture 32: Unsloth - LLM Systems Engineering

Speaker: Daniel Han
Slides

Lecture 33: BitBLAS

Speaker: Wang Lei
Code/presentation in the lecture_033 folder

Lecture 34: Low Bit Triton Kernels

Speaker: Hicham Badri
Slides

Lecture 35: SGLang Performance Optimization

Speaker: Yineng Zhang
Slides

Lecture 36: CUTLASS and Flash ATtention 3

Speaker: Jay Shah
Slides

Lecture 37: Introduction to SASS & GPU Microarchitecture

Speaker: Arun Demeure
Slides

Lecture 38: Lowbit kernels for ARM CPU

Speaker: Scott Roy
Slides

Lecture 39: TorchTitan

Speaker: Mark Saroufim and Tianyu Liu

Lecture 40: Flash Infer

Speaker: Zihao Ye

Lecture 41: CUDA Docs for Humans

Speaker: Charles Frye
Slides

Lecture 42: Mosaic GPU

Speaker: Adam Paszke

Lecture 43:

Speaker: Erik Schultheis
Slides

Lecture 57: CuTE

Speaker: Cris Cecka
Slides

Lecture 67: NCCL & NVSHMEM

Speaker: Jeff Hammond
Slides
Code

Lecture 69: Quartet 4 bit training

Speakers: Roberto Castro and Andrei Panferov
Code: https://github.com/IST-DASLab/Quartet and https://github.com/isT-DASLab/qutlass Roberto Castro and Andrei Panferov
Paper

Lecture 70: Fault tolerant communication collectives

Speaker: mike64_t
Slides

Lecture 71: [ScaleML Series] FlexOlmo: Open Language Models for Flexible Data Use

Speaker: Sewon Min
Slides

Lecture 72: [ScaleML Series] Efficient & Effective Long-Context Modeling for Large Language Models

Speaker: Guangxuan Xiao
Slides

Lecture 74: [ScaleML Series] Positional Encodings and PaTH Attention

Speaker: Songlin Yang
Slides

Lecture 75: [ScaleML Series] GPU Programming Fundamentals + ThunderKittens

Speaker 1: William Brandon
- Slides 1
Speaker 2:

Related Skills

node-connect

344.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

99.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。