62 skills found · Page 1 of 3
NVIDIA / CutlassCUDA Templates and Python DSLs for High-Performance Linear Algebra
bytedance / FluxA fast communication-overlapping library for tensor/expert parallelism on GPUs.
NVlabs / VibetensorOur first fully AI generated deep learning system
66RING / Tiny Flash Attentionflash attention tutorial written in python, triton, cuda, cutlass
coderonion / Awesome Cuda And Hpc🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
DD-DuDa / Cute LearningExamples of CUDA implementations by Cutlass CuTe
ColfaxResearch / Cutlass KernelsNo description available
MekkCyber / CutlassAcademyA curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
gbprod / Cutlass.nvimPlugin that adds a 'cut' operation separate from 'delete'
ArthurinRUC / Cutlass NotesFrom Minimal GEMM to Everything
IST-DASLab / QutlassQuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
zach-adams / Cutlass Wp ThemeCutlass is a Wordpress Starter Theme that incorporates the power of Laravel's Blade to make theme development even quicker and easier then before - http://cutlasswp.com
tgale96 / Grouped GemmPyTorch bindings for CUTLASS grouped GEMM.
leimao / CUTLASS ExamplesCUTLASS and CuTe Examples
tlc-pack / Cutlass FpA IntB GemmA standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
weishengying / Cutlass Flash Atten Fp8使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
psmarter / CUDA PracticeCUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.
flashinfer-ai / Cutlass VizNo description available
weishengying / Tiny Flash Attention使用 cutlass 实现 flash-attention 精简版,具有教学意义
andrewarrow / Cutlassswiss army knife for generating fcpxml files