Results for "gemm-optimization"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

31 skills found · Page 1 of 2

tpoisonooo / How To Optimize Gemm

713

row-major matmul optimization

universal

arm64armv7cuda+5

Updated 1d ago

flame / Blislab

561

BLISlab: A Sandbox for Optimizing GEMM

universal

bliscode-optimizationgemm+1

Updated 4d ago

leimao / CUDA GEMM Optimization

264

CUDA Matrix Multiplication Optimization

universal

cuda

Updated 4d ago

psmarter / CUDA Practice

CUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.

universal

cudacuda-kernelscutlass+12

Updated 10h ago

strin / Gemm Android

tutorial to optimize GEMM performance on android

universal

Updated 6mo ago

hyln9 / GCNGEMM

Optimized half precision gemm assembly kernels (deprecated due to ROCm)

zed

deep-learninggpu-computinghobby-project+2

Updated 2y ago

Qwesh157 / Conv Op Optimization

This project is about convolution operator optimization on GPU, include GEMM based (Implicit GEMM) convolution.

universal

convolutioncuda

Updated 17d ago

iVishalr / GEMM

Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses to compute dot products.

universal

cgemmgemm-optimization+1

Updated 3mo ago

gty111 / GEMM MMA

Optimize GEMM with tensorcore step by step

universal

Updated 20d ago

gpusgobrr / Explore Gemm

Exploring how optimizations for GEMMs work

universal

Updated 7d ago

mlsyscourse / Assignment Tirx Gemm

Blackwell GEMM Kernel Optimization

universal

Updated 3d ago

mz24cn / Gemm Optimization

The repository targets the OpenCL gemm function performance optimization. It compares several libraries clBLAS, clBLAST, MIOpenGemm, Intel MKL(CPU) and cuBLAS(CUDA) on different matrix sizes/vendor's hardwares/OS. Out-of-the-box easy as MSVC, MinGW, Linux(CentOS) x86_64 binary provided. 在不同矩阵大小/硬件/操作系统下比较几个BLAS库的sgemm函数性能，提供binary，开盒即用。

universal

blasclblasclblast+8

Updated 17d ago

Avafly / Optimize Gemm

My gemm optimization on RPi (ARM) achieved a 170x performance boost, showing speeds faster than Eigen and close to OpenBLAS.

universal

blasccpp+6

Updated 3mo ago

loveSunning / FastCuda

FastCuda is a handwritten CUDA operator library featuring progressive GEMM and Reduce kernels, cuBLAS benchmarking, and C/C++/Python interfaces for learning, profiling, and performance optimization.

universal

cudacflash-attentionhgemm+7

Updated 19d ago

zartbot / Tensorcore Gemm

TensorCore GEMM Optimization

universal

Updated 5mo ago

Faraz9877 / H100 GEMM

High-performance GEMM implementation optimized for NVIDIA H100 GPUs, leveraging Hopper architecture's TMA, WGMMA, and Thread Block Clusters for near-peak theoretical performance.

zed

Updated 3d ago

fengyuentau / How To Optimize Gemm Opencl

Step-by-step GEMM optimization tutorial on OpenCL GPU platforms

universal

Updated 17d ago

nadavrot / Bistra

Bistra is a domain-specific language designed to generate high-performance kernels (such as GEMMs, convolutions, etc). The program is designed to allow powerful compiler optimizations and code generation that are not possible in C. The tool can auto-tune GEMM kernels to around 90% of peak performance (on X86/AVX2) within seconds.

universal

compilerjitllvm

Updated 6mo ago

fileaccent / Sgemm Exercise

hip gemm optimization

universal

Updated 11mo ago

xziya / Gemm Opt

Manually optimize the GEMM (GEneral Matrix Multiply) operation. There is a long way to go.

universal

cppcpugemm+1

Updated 2y ago