Matmul
100% MFU or bust.
Install / Use
/learn @jafioti/MatmulREADME
Matmul
This is an experiment repo for writing fast matrix multiplication kernels on Metal.
Current speeds of a 4096x4096x4096 matmul on M1 Pro:
Naive: 6866 ms
Warp Coalesced: 635 ms
SMEM Tiled: 398 ms
1D Register Tiled: 240 ms
2D Register Tiled: 171 ms
SIMD: 41 ms
2D SIMD: 69 ms
SIMD Prefetch: 54 ms
MLX: 48 ms
The result of this exercise is a deep understanding of the kernels. This feeds into writing good compilers for Luminal.
Related Skills
node-connect
342.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
85.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
342.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
