Results for "fused-kernel"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

26 skills found

ROCm / Iris

183

AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

universal

async-programmingcommunicationdistributed-computing+15

Updated 5d ago

arozanov / Turboquant Mlx

TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.

universal

apple-siliconkv-cachellm+4

Updated 5h ago

Libraries-Openly-Fused / FusedKernelLibrary

We aim to redefine Data Parallel libraries portabiliy, performance, programability and maintainability, by using C++ standard features, instead of creating new compilers.

universal

Updated 4d ago

HipGraph / FusedMM

Implementation of FusedMM method for IPDPS 2021 paper titled "FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph Neural Networks"

universal

fused-kernelgeneral-purpose-librarygraph-embedding+6

Updated 10mo ago

lszxb / Bf16 Huffman Infer

Fused BF16 Huffman GEMV Inference kernel

universal

Updated 7d ago

BGU-CS-VIL / Sdtw Cuda Torch

GPU-accelerated Soft Dynamic Time Warping (SoftDTW) for PyTorch. Differentiable loss function with ~98% memory savings via fused CUDA kernels, arbitrary sequence lengths, and log-space numerical stability.

universal

barycentercudadtw+5

Updated 14d ago

kozistr / Candle Moe

fused MoE kernel in Candle backend

universal

Updated 20d ago

PingoLH / CatConv2d

Concat+Conv2d fused all-in-one CUDA kernel extension for Pytorch

universal

Updated 10mo ago

fattorib / Fusedswiglu

Fused SwiGLU Triton kernels

universal

cudagpuswiglu+2

Updated 17d ago

dtunai / Tri RMSNorm

Efficient kernel for RMS normalization with fused operations, includes both forward and backward passes, compatibility with PyTorch.

universal

aimachine-learningrmsnorm+1

Updated 9d ago

thebasedcapital / Ane Infer

Apple Neural Engine (ANE) LLM inference engine — reverse-engineered private APIs, Metal GPU shaders, hybrid ANE+GPU+CPU on Apple Silicon. 32 tok/s matching llama.cpp, 3.6 TFLOPS fused ANE mega-kernels.

universal

aneapple-neural-engineapple-silicon+13

Updated 12d ago

WithNucleusAI / MHC Triton

Manifold-Constrained Hyper-Connections with fused Triton kernels for efficient training

universal

deepseekhyper-connectionstriton-kernels

Updated 1mo ago

tomatillos / Loopfuse

Fused Triton kernel generator

universal

Updated 7mo ago

yianan261 / Multi GPU TRAINING OPTIMIZATION

This project optimizes multi-GPU parallelism for machine learning training by accelerating multi-GPU using fused gradient buffers, NCCL AllReduce, and CUDA C kernel-level optimizations including memory coalescing, shared memory tiling, loop unrolling, and stream-based communication overlap.

universal

Updated 9mo ago

varjoranta / Turboquant Vllm

TurboQuant+ KV cache compression for vLLM. 3.8x smaller KV cache, same conversation quality. Fused CUDA kernels with automatic PyTorch fallback.

universal

Updated 2h ago

shixun404 / TurboFNO

The first fully fused FFT–GEMM–iFFT GPU kernel.

universal

Updated 1mo ago

RegularJoe-CEO / Geodesic Attention Engine GAE

Geodesic Attention Engine - Minimum-energy path through transformer attention. Fused Waller Kernel reduces HBM round-trips from 12 to 2. O(N) memory complexity, 23-37% Tok/J improvement, bit-exact determinism. No approximation, no sparsity - just the shortest path.

universal

Updated 1d ago

Lulzx / Tiny Kernel

fused metal kernels for llm inference. zig + metal. no python. no pytorch.

universal

apple-siliconflash-attentiongpu+5

Updated 1mo ago

chinmaydk99 / INT8 Triton Kernels

Triton kernels for quantisation, fused dequant and GEMM operations for efficient inference

universal

Updated 4mo ago

Argonaut790 / Fused Turboquant

Fused Triton kernels for TurboQuant KV cache compression — 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.

universal

attentioncompressioncuda+17

Updated 20m ago