Project Progress and Tasks

Bro in CUDA 📗 : https://github.com/a-hamdi/cuda

Mentor 🚀 : https://github.com/hkproj | https://github.com/hkproj/100-days-of-gpu

Mandatory and Optional Tasks

| Day | Task Description | |-------|-----------------------------------------------------------------------------------------------------| | D15 | Mandatory FA2-Forward: Implement forward pass for FA2 (e.g., a custom neural network layer). | | D20 | Mandatory FA2-Backwards: Implement backward pass for FA2 (e.g., gradient computation). | | D20 | Optional Fused Chunked CE Loss + Backwards: Fused implementation of chunked cross-entropy loss with backward pass. Can use Liger Kernel as a reference implementation. |

Project Progress by Day

| Day | Files & Summaries | |-------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | day1 | printAdd.cu: Print global indices for 1D vector (index calculation). addition.cu: GPU vector addition; basics of memory allocation/host-device transfer. | | day2 | function.cu: Use __device__ function in kernel; per-thread calculations. | | day3 | addMatrix.cu: 2D matrix addition; map row/column indices to threads. anotherMatrix.cu: Transform matrices with custom function; 2D index operations. | | day4 | layerNorm.cu: Layer normalization using shared memory; mean/variance computation. | | day5 | vectorSumTricks.cu: Parallel vector sum via reduction; shared memory optimizations. | | day6 | SMBlocks.cu: Retrieve SM ID per thread via inline PTX. SoftMax.cu: Shared-memory softmax; split exponent/normalization steps. TransposeMatrix.cu: Matrix transpose via index swapping. ImportingToPython/rollcall.cu: Python-CUDA integration. AdditionKernel/additionKernel.cu: Modify PyTorch tensors in CUDA. | | day7 | naive.cu: Naive matrix multiplication. matmul.cu: Tiled matmul with shared memory. conv1d.cu: 1D convolution with shared memory. pythontest.py: Validate custom convolution against PyTorch. | | day8 | pmpbook/chapter3matvecmul.cu: Matrix-vector multiplication. pmpbook/chapter3ex.cu: Benchmarks different matrix add kernels. pmpbook/deviceinfo.cu: Prints device properties. pmpbook/color2gray.cu: Convert RGB to grayscale. pmpbook/vecaddition.cu: Another vector addition example. pmpbook/imageblur.cu: Simple image blur. selfAttention/selfAttention.cu: Self-attention kernel with online softmax. | | day9 | flashAttentionFromTut.cu: Minimal Flash Attention kernel with shared memory tiling. bind.cpp: Torch C++ extension bindings for Flash Attention. test.py: Tests the minimal Flash Attention kernel against a manual softmax-based attention for comparison. | | day10 | ppmbook/matrixmul.cu: Matrix multiplication using CUDA. setup.py: Torch extension build script for CUDA code (FlashAttention). FlashAttention.cu: Example Flash Attention CUDA kernel. FlashAttention.cpp: Torch bindings for the Flash Attention kernel. test.py: Manual vs. CUDA-based attention test. linking/test.py: Builds simple CUDA kernel for testing linking. linking/simpleKernel.cpp: Torch extension binding for a simple CUDA kernel. linking/simpleKernel.cu: Simple CUDA kernel that increments a tensor. | | day11 | FlashTestPytorch/: Custom Flash Attention in PyTorch, tests and benchmarks. testbackward.py: Gradient comparison between custom CUDA kernels and PyTorch. | | day12 | softMax.cu: Additional softmax kernel with shared memory optimization. NN/kernels.cu: Tiled kernel implementation and layer initialization. tileMatrix.cu: Demonstrates tile-based matrix operations. | | day13 | RMS.cu: RMS kernel (V1) with naive sum-of-squares approach. RMSBetter.cu: RMS kernel (V2) using warp-reduce optimization,float4 +others . binding.cpp: Torch bindings for RMS kernels. test.py: Tests and benchmarks RMS kernels vs PyTorch. | | day14 | FA2/flash.cu & kernels.cu: Second iteration of Flash Attention featuring partial forward/backward logic. helper.cuh: Utility functions and warp-reduce helpers. conv.cu: Basic 2D convolution with shared memory. | | day15 | Attention.cu: Single-headed attention kernel vs. CPU reference. dotproduct.cu: Batched/tiled dot-product kernel for vectors or matrices. SMM.cu: Sparse matrix multiplication in CSR format. | | day16 | attentionbwkd.cu: Extends attention with gradient computation; forward & backward passes. | | day17 | cublas1.cu, cublas2.cu, cublas3.cu: Various cuBLAS examples for dot products, axpy, max/min, and other BLAS operations. | | day18 | wrap.cu: Warp-based reduction and max-finding with inline PTX. atomic1.cu, atomic2.cu: Implement and test custom atomic increment operations. | | day19 | cublasMM.cu: Matrix multiplication with cuBLAS plus a simple self-attention example. | | day20 | rope.cu: Rope (rotary positional encoding) kernel and its PyTorch extension. test_rope.py: Benchmarks for the rope kernel. |

How to load into Pytorch:

(optional) create tempalte kernel
create kernelforward where you set up the grids and other calculations
create .cpp file
import the header of the file
create a wraper so that you can use tensors
use PYBIN11_MODULE to create a torchextension
in .py file : torch.utils.cpp_extension.load() use it to load the files and it will compile

100Days

Install / Use

README

Project Progress and Tasks

Mandatory and Optional Tasks

Project Progress by Day

How to load into Pytorch: