Results for "efficient-llm-inference"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

73 skills found · Page 1 of 3

vllm-project / Vllm

75.8k

A high-throughput and memory-efficient inference and serving engine for LLMs

universal

amdblackwellcuda+17

Updated 15m ago

NVIDIA / TensorRT LLM

13.3k

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.

universal

blackwellcudallm-serving+2

Updated 5h ago

nobodywho-ooo / Nobodywho

779

NobodyWho is an inference engine that lets you run LLMs locally and efficiently on any device.

universal

flutterflutter-aigodot+9

Updated 3h ago

LeanModels / DFloat11

617

DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference

universal

compressiondiffusion-modelsgpu+2

Updated 6d ago

mit-han-lab / Duo Attention

535

[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

universal

Updated 5d ago

NLPOptimize / Flash Tokenizer

509

EFFICIENT AND OPTIMIZED TOKENIZER ENGINE FOR LLM INFERENCE SERVING

zed

bertberttokenizercpp+11

Updated 6h ago

hao-ai-lab / Consistency LLM

414

[ICML 2024] CLLMs: Consistency Large Language Models

universal

efficient-llmefficient-llm-inferencelarge-language-models

Updated 13d ago

NVIDIA / Star Attention

393

Efficient LLM Inference over Long Sequences

universal

attention-mechanismlarge-language-modelsllm-inference

Updated 1d ago

mit-han-lab / Quest

380

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

universal

Updated 8d ago

intel / Neural Speed

351

An innovative library for efficient LLM inference via low-bit quantization

universal

cpufp4fp8+17

Updated 7d ago

usyd-fsalab / Fp6 Llm

278

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

universal

Updated 7d ago

inferflow / Inferflow

250

Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).

universal

baichuan2bloomdeepseek+16

Updated 24d ago

AlibabaResearch / Flash Llm

239

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

universal

Updated 9d ago

opengear-project / GEAR

181

GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM

universal

Updated 10d ago

z-lab / Paroquant

180

[ICLR 2026] ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

universal

Updated 4h ago

hao-ai-lab / JacobiForcing

149

Jacobi Forcing: Fast and Accurate Diffusion-style Decoding

universal

efficient-llm-inferencelarge-language-modelsparallel-decoding

Updated 7d ago

SqueezeBits / QUICK

120

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

universal

Updated 2mo ago

VectorInstitute / Vector Inference

Efficient LLM inference on Slurm clusters.

universal

audio-transcriptioninferencellm+9

Updated 1h ago

PiotrNawrot / Nano Sparse Attention

The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.

universal

Updated 1d ago

jkanalakis / Deep Recall

Enterprise-grade memory framework for LLMs featuring GPU-optimized inference, vector storage, and automated scaling. Enables hyper-personalized responses through efficient context retrieval and integration.

zed

agentic-aicontextllm+2

Updated 19d ago