SkillAgentSearch skills...

Matmul

100% MFU or bust.

Install / Use

/learn @jafioti/Matmul
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Matmul

This is an experiment repo for writing fast matrix multiplication kernels on Metal.

Current speeds of a 4096x4096x4096 matmul on M1 Pro:

Naive: 6866 ms
Warp Coalesced: 635 ms
SMEM Tiled: 398 ms
1D Register Tiled: 240 ms
2D Register Tiled: 171 ms
SIMD: 41 ms
2D SIMD: 69 ms
SIMD Prefetch: 54 ms
MLX: 48 ms

The result of this exercise is a deep understanding of the kernels. This feeds into writing good compilers for Luminal.

Related Skills

View on GitHub
GitHub Stars11
CategoryDevelopment
Updated2mo ago
Forks1

Languages

Cuda

Security Score

70/100

Audited on Jan 12, 2026

No findings