89 skills found · Page 1 of 3
xlite-dev / LeetCUDA📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
NVIDIA / Cuda TileCUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns and optimizations targeting NVIDIA tensor core units.
Bruce-Lee-LY / Cuda HgemmSeveral optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
HazyResearch / Flash Fft ConvFlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
daddydrac / Deprecated NVIDIA GPU Tensor Core Accelerator PyTorch OpenCVComputer vision container that includes Jupyter notebooks with built-in code hinting, Anaconda, CUDA 11.8, TensorRT inference accelerator for Tensor cores, CuPy (GPU drop in replacement for Numpy), PyTorch, PyTorch geometric for Graph Neural Networks, TF2, Tensorboard, and OpenCV for accelerated workloads on NVIDIA Tensor cores and GPUs.
xlite-dev / HGEMM⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
wzsh / Wmma Tensorcore SampleMatrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
enp1s0 / OzIMMUFP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme
emer / EtableData table structure in Go, now developed at https://github.com/cogentcore/core/tree/main/tensor
sunlex0717 / DissectingTensorCoresNo description available
wmmae / Wmma ExtensionAn extension library of WMMA API (Tensor Core API)
ParCIS / MagicubeMagicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.
psmarter / CUDA PracticeCUDA编程练习项目-Hands-on CUDA kernels and performance optimization, covering GEMM, FlashAttention, Tensor Cores, CUTLASS, quantization, KV cache, NCCL, and profiling.
Bruce-Lee-LY / Cuda HgemvSeveral optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
stillwater-sc / RISC V TensorCoreTransactional Verilog design and Verilator Testbench for a RISC-V TensorCore Vector co-processor for reproducible linear algebra
UDC-GAC / VenomA Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores
YukeWang96 / TC GNN ATC23Artifact for USENIX ATC'23: TC-GNN: Bridging Sparse GNN Computation and Dense Tensor Cores on GPUs.
JuliaMath / TensorCore.jlLightweight package for sharing tensor-algebra definitions
daddydrac / Computer Vision ContainerThis container is no longer supported, and has been deprecated in favor of: https://github.com/joehoeller/NVIDIA-GPU-Tensor-Core-Accelerator-PyTorch-OpenCV
WvanWoerden / G6K GPU TensorLattice Sieving using GPU Tensor cores based on the General Sieve Kernel (G6K)