XGeMM
Accelerated General (FP32) Matrix Multiplication from scratch in CUDA
Install / Use
/learn @tgautam03/XGeMMREADME
xGeMM
Accelerated General (FP32) Matrix Multiplication. Tested on NVIDIA RTX 3090 using Ubuntu 24.04.1 LTS with nvidia-driver-550 and CUDA 12.4.
Watch the YouTube video (click the image below)
Dependencies
- Eigen 3.4.0 (Put it in
lib)
Running Benchmarks
1. Eigen (CPU) matrix multiplication
Compile: make 00a_benchmark_cpu.out
Execute: ./00a_benchmark_cpu.out
2. cuBLAS (GPU) matrix multiplication:
Compile: make 00b_benchmark_cuBLAS.out
Execute: ./00b_benchmark_cuBLAS.out
3. Naive (GPU) matrix multiplication:
Compile: make 01_benchmark_naive.out
Execute: ./01_benchmark_naive.out
4. Coalesced (GPU) matrix multiplication:
Compile: make 02_benchmark_coalesced.out
Execute: ./02_benchmark_coalesced.out
5. Tiled (GPU) matrix multiplication:
Compile: make 03_benchmark_tiled.out
Execute: ./03_benchmark_tiled.out
6. 1D thread coarsening (GPU) matrix multiplication:
Compile: make 04_benchmark_coarse_1d.out
Execute: ./04_benchmark_coarse_1d.out
7. 2D thread coarsening (GPU) matrix multiplication:
Compile: make 05_benchmark_coarse_2d.out
Execute: ./05_benchmark_coarse_2d.out
8. Vectorized Mmemory accesses (GPU) matrix multiplication:
Compile: make 06_benchmark_coarse_2d_vec.out
Execute: ./06_benchmark_coarse_2d_vec.out
