Turboquant

TurboQuant: Near-optimal KV cache quantization for LLM inference (3-bit keys, 2-bit values) with Triton kernels + vLLM integration

Generate Convert Improve

Install / Use

/learn @0xSero/Turboquant

About this skill

Quality Score

0/100

README

TurboQuant: KV Cache Compression for LLM Inference

Implementation of TurboQuant KV cache compression (ICLR 2026, arXiv:2504.19874) with vLLM integration. Tested on dense and MoE architectures across RTX 3090 and RTX 5090 GPUs.

Benchmark Results

RTX 5090 (32GB) -- Qwen3.5-27B-AWQ (dense, 4-bit weights, TP=1)

Setup: Single RTX 5090, vLLM 0.18.0, gpu_memory_utilization=0.90, 16 full-attention layers out of 64 total (rest are linear-attention).

| Metric | Baseline (bf16 KV) | TurboQuant (3b key / 2b val) | |--------|-------------------|------------------------------| | Prefill tok/s (30k ctx) | 1,804 | 1,907 (+5.7%) | | Decode tok/s (30k ctx) | 1.264 | 1.303 (+3.1%) | | KV cache freed | -- | 30.0 GB (across 4 GPUs) | | Max token capacity | 457,072 | 914,144 (2.0x) | | Peak activation memory | 644.6 MB | 599.2 MB (-7.0%) |

8x RTX 3090 (24GB each) -- Qwen3.5-35B-A3B MoE (pruned, 205 experts, TP=8)

Setup: 8x RTX 3090, vLLM 0.18.0, gpu_memory_utilization=0.92, AMD EPYC 7443P 24-Core, 504GB RAM. Model has 10 full-attention layers + 30 linear-attention layers (40 total). TQ compresses only the 10 full-attention layers.

Throughput & Latency (Baseline, bf16 KV)

| Context | Prefill tok/s | Decode tok/s | TTFT (s) | Needles Found | |--------:|--------------:|-------------:|---------:|--------------:| | 1,000 | 7,127 | 129.7 | 0.14 | 4/5 | | 4,000 | 8,887 | 131.5 | 0.45 | 4/5 | | 8,000 | 9,684 | 131.1 | 0.83 | 4/5 | | 16,000 | 9,933 | 133.0 | 1.61 | 4/5 | | 32,000 | 9,761 | 116.7 | 3.28 | 4/5 | | 64,000 | 8,843 | 122.6 | 7.24 | 4/5 | | 100,000 | 8,479 | 106.8 | 11.79 | 4/5 | | 131,000 | 8,238 | 98.3 | 15.90 | 4/5 |

Prefill saturates around 10k tok/s, degrades gently to 8.2k at 131k context.
Decode drops from 133 to 98 tok/s at 131k (KV readback cost from full-attention layers).
TTFT scales linearly with context length (purely compute-bound).
Needles 4/5 found consistently at ALL context lengths -- the model reformats one answer.

VRAM Breakdown (per GPU at 131k context)

| Component | Size | |-----------|-----:| | Total VRAM | 24,576 MB | | Reserved (0.92 util) | 22,610 MB | | Model weights | ~6,750 MB | | KV cache pool | 9,035 MB | | -- full_attention (10 layers) | 3,614 MB | | -- linear_attention (30 layers) | 5,421 MB | | CUDA overhead + graphs | ~6,825 MB |

Baseline vs TurboQuant KV Cache

| Context | Baseline KV/GPU | TQ KV/GPU | Savings/GPU | Savings % | |--------:|----------------:|----------:|------------:|----------:| | 8,000 | 55.7 MB | 38.5 MB | 17.2 MB | 30.9% | | 32,000 | 191.5 MB | 132.3 MB | 59.3 MB | 30.9% | | 64,000 | 374.3 MB | 258.5 MB | 115.8 MB | 30.9% | | 100,000 | 578.1 MB | 399.2 MB | 178.8 MB | 30.9% | | 131,000 | 755.7 MB | 521.9 MB | 233.8 MB | 30.9% |

Savings are 30.9% of total KV because TQ only compresses the 10 full-attention layers (40% of KV).
The 30 linear-attention layers (60% of KV) are not compressible by TQ.
On a pure dense transformer, savings would be 77% (4.4x compression).

Context Extension

| | Tokens | Multiplier | |---|-------:|:----------:| | Baseline capacity | 1,411,680 | 1.0x | | With TQ | 2,043,808 | 1.45x |

Alternatively, freed VRAM supports 3 additional concurrent 131k-context requests.

Coherence & Quality

| Test | Result | |------|--------| | Single needle (512-131k tokens) | PASS at all lengths | | 5-needle at near-max context | 5/5 retrieved | | 3-needle multi-fact coherence | 3/3 retrieved | | Golden ratio completion (all lengths) | PASS, perplexity 1.05-1.35 | | Math reasoning at max context | Coherent (model math error from pruning, not context) |

TQ Quantization Quality (head_dim=256, measured on GPU)

| Component | cos_sim | Notes | |-----------|--------:|-------| | TQ key compression (3-bit) | 1.000000 | Near-lossless | | TQ key compression (4-bit) | 1.000000 | Near-lossless | | Value quantization (2-bit) | 0.940 | Bottleneck for quality | | Value quantization (4-bit) | 0.997 | Recommended for quality-sensitive use | | Combined (3b key + 2b val) | 0.940 | Value quant dominates degradation |

GPU Utilization During Inference

| Context | Peak VRAM/GPU | GPU Util | CPU % | Power | |--------:|--------------:|---------:|------:|------:| | 1,000 | 22,284 MB | 0% idle | 0.2% | 132W | | 32,000 | 22,286 MB | 57% peak | 0.4% | 142W | | 131,000 | 22,306 MB | 0% idle | 0.4% | 130W |

VRAM is essentially flat -- KV cache at 131k is only 190 MB/GPU (0.8% of VRAM).
No CPU offloading. No KV offloading. Everything fits in VRAM.
GPU interconnect is PCIe (no NVLink) -- NODE topology between all GPUs.

Paper Validation (Theorems 1-3)

9 tests validating the paper's theoretical claims:

| Claim | Verdict | Details | |-------|---------|---------| | MSE distortion bounds (Thm 1) | PASS | Within bounds for unit-norm vectors | | Codebook MSE matches Table 1 | PASS | Lloyd-Max codebook is faithful | | Unbiasedness (Thm 2) | PASS | Relative bias < 0.1% | | Distortion 1/4^b scaling (Thm 3) | PASS | 2-bit=0.70x, 3-bit=0.82x, 4-bit=0.97x of bound | | Recall@8 (3-bit, N=4096) | 0.55 | Paper threshold met (>=0.40) | | Rank correlation (N=2048) | PASS | Spearman rho > 0.85 | | Needle retrieval | PASS | Works at all SNR levels | | Compression ratio | 4.41x | At head_dim=256 on full-attention layers |

Adversarial Audit

Honest assessment of claims (see audit_claims.py for full data):

| Claim | Verdict | |-------|---------| | "5.1x compression" | Misleading -- doesn't count Pi/S matrices or ring buffer. Honest: ~4.6x at 4k tokens, ~5x at 32k+ | | "Needle-in-haystack passes" | True but trivial -- query=key test is too easy. Real LLM queries are not copies of keys | | "Recall@8 >= 0.40" | Low bar -- 3-bit recall@1 is only 38%. BUT dominant attention tokens are always preserved | | "Hybrid decode saves memory" | Storage yes, compute no -- dequantizes all history to float32 per decode step | | "Distortion follows 1/4^b" | True -- initial audit was wrong (unnormalized vectors). Unit-norm: within bound | | "30k TQ is faster" | Within noise -- N=1 run, total wall time TQ is actually slower | | "200k context works" | Unverified -- didn't crash, but output quality never checked | | "2x context on dense model" | True -- measured 30 GB freed on Qwen3.5-27B with 4x RTX 3090 |

How It Works

TurboQuant compresses KV cache entries using:

Random orthogonal rotation to spread information across dimensions
Lloyd-Max optimal scalar quantization (b-1 bits) on Beta-distributed rotated values
QJL projection for residual sign bits (1 bit per dimension)
Group quantization for values (2-bit or 4-bit, per-group scales and zeros)
Bit-packing: 4 values per byte (2-bit) or 2 per byte (4-bit)

The combined estimator is unbiased: E[estimated inner product] = true inner product.

Architecture

turboquant/
  codebook.py          # Lloyd-Max optimal scalar quantizer for Beta distribution
  codebooks/           # Pre-generated codebook files (d=128/256, bits 2/3/4)
  rotation.py          # Random orthogonal rotation + QJL projection matrices
  quantizer.py         # TurboQuantMSE + TurboQuantProd (Algorithms 1 & 2)
  kv_cache.py          # KV cache manager with value bit-packing
  capture.py           # Modular KV capture hooks for attention layers
  store.py             # Compressed KV store (quantize + append + flat cache)
  score.py             # Attention scoring from compressed keys
  integration/vllm.py  # vLLM adapter (monkey-patch, free_kv_cache, hybrid decode)
  triton_kernels.py    # 3 fused Triton kernels for decode attention
  vllm_attn_backend.py # Thin shim delegating to integration/vllm.py

validate_paper.py      # 9 tests validating Theorems 1-3
audit_claims.py        # Adversarial audit of all claims
test_modular.py        # 19 modular architecture tests
test_turboquant.py     # 7 core quantizer tests
proof.py               # A/B benchmark (baseline vs TQ)

# Profiling scripts (8x RTX 3090 MoE validation)
validate_moe.py        # Baseline measurements via vLLM API
validate_moe_phase2.py # TQ quality on real GPU with head_dim=256
validate_moe_phase3.py # Logprobs, multi-needle, reasoning at max context
profile_100k.py        # Full profiling at 1k-131k context
profile_large.py       # Large context (64k-131k) with file-based payloads
baseline_vs_tq.py      # VRAM comparison: baseline bf16 vs TQ compressed
baseline_vs_tq_v2.py   # Block-level measurement during inference

Usage

pip install -e .

# Run paper validation (CPU, no GPU needed)
python validate_paper.py

# Run adversarial audit
python audit_claims.py

# Run modular tests
python -m pytest test_modular.py -v

# Run proof benchmark (requires 4x RTX 3090 + Qwen3.5-27B-AWQ)
CUDA_VISIBLE_DEVICES=0,1,4,6 python proof.py

Test Results

All 35 tests pass:

test_modular.py: 19/19 (modular architecture)
test_turboquant.py: 7/7 (core quantizer)
validate_paper.py: 9/9 (paper theorem validation)

Limitations

Prefill still uses paged cache: KV cache is allocated at engine init and used during prefill. TQ frees it after. True zero-allocation requires deeper vLLM integration.
Only full-attention layers: Linear-attention/Mamba layers are not compressed.
Value quantization is the bottleneck: 2-bit values cause cos_sim=0.94 degradation. Use 4-bit values (cos_sim=0.997) for quality-sensitive workloads.
Hybrid decode dequantizes all history: During compute, all compressed tokens are expanded to float32. The paper's fused Triton kernels exist but the hybrid path doesn't use them yet.
MoE models benefit less: Models with linear-attention layers (Qwen3.5 MoE, Mamba hybrids) have incompressible state that limits TQ's overall impact.

Environment

Tested on:

vLLM 0.18.0, PyTorch 2.10, CUDA 12

Related Skills

node-connect

339.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

339.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.9k

Commit, push, and open a PR