26 skills found
ROCm / IrisAMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
arozanov / Turboquant MlxTurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.
Libraries-Openly-Fused / FusedKernelLibraryWe aim to redefine Data Parallel libraries portabiliy, performance, programability and maintainability, by using C++ standard features, instead of creating new compilers.
HipGraph / FusedMMImplementation of FusedMM method for IPDPS 2021 paper titled "FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph Neural Networks"
lszxb / Bf16 Huffman InferFused BF16 Huffman GEMV Inference kernel
BGU-CS-VIL / Sdtw Cuda TorchGPU-accelerated Soft Dynamic Time Warping (SoftDTW) for PyTorch. Differentiable loss function with ~98% memory savings via fused CUDA kernels, arbitrary sequence lengths, and log-space numerical stability.
kozistr / Candle Moefused MoE kernel in Candle backend
PingoLH / CatConv2dConcat+Conv2d fused all-in-one CUDA kernel extension for Pytorch
fattorib / FusedswigluFused SwiGLU Triton kernels
dtunai / Tri RMSNormEfficient kernel for RMS normalization with fused operations, includes both forward and backward passes, compatibility with PyTorch.
thebasedcapital / Ane InferApple Neural Engine (ANE) LLM inference engine — reverse-engineered private APIs, Metal GPU shaders, hybrid ANE+GPU+CPU on Apple Silicon. 32 tok/s matching llama.cpp, 3.6 TFLOPS fused ANE mega-kernels.
WithNucleusAI / MHC TritonManifold-Constrained Hyper-Connections with fused Triton kernels for efficient training
tomatillos / LoopfuseFused Triton kernel generator
yianan261 / Multi GPU TRAINING OPTIMIZATIONThis project optimizes multi-GPU parallelism for machine learning training by accelerating multi-GPU using fused gradient buffers, NCCL AllReduce, and CUDA C kernel-level optimizations including memory coalescing, shared memory tiling, loop unrolling, and stream-based communication overlap.
varjoranta / Turboquant VllmTurboQuant+ KV cache compression for vLLM. 3.8x smaller KV cache, same conversation quality. Fused CUDA kernels with automatic PyTorch fallback.
shixun404 / TurboFNOThe first fully fused FFT–GEMM–iFFT GPU kernel.
RegularJoe-CEO / Geodesic Attention Engine GAE Geodesic Attention Engine - Minimum-energy path through transformer attention. Fused Waller Kernel reduces HBM round-trips from 12 to 2. O(N) memory complexity, 23-37% Tok/J improvement, bit-exact determinism. No approximation, no sparsity - just the shortest path.
Lulzx / Tiny Kernelfused metal kernels for llm inference. zig + metal. no python. no pytorch.
chinmaydk99 / INT8 Triton KernelsTriton kernels for quantisation, fused dequant and GEMM operations for efficient inference
Argonaut790 / Fused TurboquantFused Triton kernels for TurboQuant KV cache compression — 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.