73 skills found · Page 1 of 3
vllm-project / VllmA high-throughput and memory-efficient inference and serving engine for LLMs
NVIDIA / TensorRT LLMTensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.
nobodywho-ooo / NobodywhoNobodyWho is an inference engine that lets you run LLMs locally and efficiently on any device.
LeanModels / DFloat11DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference
mit-han-lab / Duo Attention[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
NLPOptimize / Flash TokenizerEFFICIENT AND OPTIMIZED TOKENIZER ENGINE FOR LLM INFERENCE SERVING
hao-ai-lab / Consistency LLM[ICML 2024] CLLMs: Consistency Large Language Models
NVIDIA / Star AttentionEfficient LLM Inference over Long Sequences
mit-han-lab / Quest[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
intel / Neural SpeedAn innovative library for efficient LLM inference via low-bit quantization
usyd-fsalab / Fp6 LlmAn efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
inferflow / InferflowInferflow is an efficient and highly configurable inference engine for large language models (LLMs).
AlibabaResearch / Flash LlmFlash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
opengear-project / GEARGEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
z-lab / Paroquant[ICLR 2026] ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
hao-ai-lab / JacobiForcingJacobi Forcing: Fast and Accurate Diffusion-style Decoding
SqueezeBits / QUICKQUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
VectorInstitute / Vector InferenceEfficient LLM inference on Slurm clusters.
PiotrNawrot / Nano Sparse AttentionThe simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
jkanalakis / Deep RecallEnterprise-grade memory framework for LLMs featuring GPU-optimized inference, vector storage, and automated scaling. Enables hyper-personalized responses through efficient context retrieval and integration.