QuickRunCUDA
No description available
Install / Use
/learn @ademeure/QuickRunCUDAREADME
QuickRunCUDA
This is the microbenchmarking framework I used to build the project that won the SemiAnalysis GPU Hackathon ("Optimizing NVIDIA Blackwell’s Split L2"): https://semianalysis.com/2025-hackathon-eol/
The finished & polished project code is available here: https://github.com/ademeure/QuickRunCUDA/blob/main/tests/side_aware.cu
Example command to run the L2 Side Aware reduction that calculates the FP32 absmax of an input array (on H100/GH200/GB200):
make
./QuickRunCUDA -i -p -t 1024 -A 1000000000 -0 1000000000 -T 100 -P 4.0 -U GB/s tests/side_aware.cu
You can uncomment "FORCE_RANDOM_SIDE" to prevent the optimization (but keeping some of the overhead). This shows that performance doesn't significantly improve, but it reduces power consumption by up to ~9% on GH200 with random data ('-r' flag)!
It is possible to extend this to any elementwise operation or memcpy, but it requires very complicated manual memory management to make it work on both the input and output sides simultaneously. So it can't really be done as part of this kind of microbenchmarking framework. It might be possible to do it in PyTorch using a custom allocator and mempool but I'm not 100% sure at this point.
Let me know if you have any questions about the L2 Side Aware project or the QuickRunCUDA framework in general!
Related Skills
node-connect
349.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.7kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
