Rotorquant
KV cache compression via block-diagonal rotation. Beats TurboQuant: better PPL (6.91 vs 7.07), 28% faster decode, 5.3x faster prefill, 44x fewer params. Drop-in llama.cpp integration.
Install / Use
/learn @scrya-com/RotorquantREADME
RotorQuant: KV Cache Compression for LLMs
Drop-in KV cache quantization that bypasses the butterfly network using block-diagonal rotations. Beats Google's TurboQuant on every axis: better PPL, 28% faster decode, 5x faster prefill, 44x fewer parameters.
"Replace the d×d random orthogonal matrix with Clifford rotors... exploiting algebraic sparsity" — RotorQuant paper, March 2026
Headline Results
Llama 3.1 8B Instruct Q4_K_M — Symmetric 3-bit K+V Compression (RTX 5090)
| Config (K/V) | Decode tok/s | Prefill tok/s | PPL (wiki-2) | vs FP16 | Compression | |---|---:|---:|---:|---|---| | f16 / f16 | 140 | 6,156 | 6.63 | baseline | 1x | | iso3 / iso3 | 118 | 3,397 | 6.91 | +4.2% | 10.3x | | planar3 / planar3 | 119 | 3,822 | 7.05 | +6.3% | 10.3x | | turbo3 / turbo3 | 93 | 722 | 7.07 | +6.6% | 10.3x | | planar3 / turbo3 | 127 | — | 6.68 | +0.8% | 10.3x | | planar3 / f16 | 134 | — | ~6.63 | ~0% | 5.1x |
vs TurboQuant (same 10.3x compression):
- PPL: iso3 6.91 vs turbo3 7.07 — better quality
- Decode: 119 tok/s vs 93 tok/s — 28% faster
- Prefill: 3,822 tok/s vs 722 tok/s — 5.3x faster
- Parameters: 128 vs 16,384 — 44x fewer (per paper Table 1)
Why Faster?
The butterfly bypass from the RotorQuant paper: TurboQuant applies a d×d Walsh-Hadamard Transform (butterfly network with log₂(d) stages across all 128 dimensions). PlanarQuant/IsoQuant apply independent 2D/4D rotations per pair/quartet — O(d) instead of O(d log d), fully parallelizable, no inter-element dependencies. The deferred K-cache (F16 during prefill) eliminates rotation overhead entirely during prompt processing.
Architecture Evolution
The original RotorQuant paper proposed Clifford algebra Cl(3,0) rotors — the rotor sandwich product RxR̃ with only 4 non-zero multivector components. The insight: you don't need a full-rank d×d transform to decorrelate KV cache vectors; small orthogonal blocks suffice because real attention vectors live on low-rank manifolds.
This led to three progressively simpler implementations. PlanarQuant (2D Givens) and IsoQuant (4D quaternion) were developed by @ParaMind2025, building on the block-diagonal rotation idea:
| Method | Rotation | Group Size | FMAs (d=128) | Params | Status | |---|---|---:|---:|---:|---| | RotorQuant | Cl(3,0) rotor sandwich | 3 | ~2,400 | 372 | Research (Triton) | | IsoQuant | Quaternion 4D | 4 | 512 | 128 | Production (llama.cpp) | | PlanarQuant | Givens 2D | 2 | 256 | 128 | Production (llama.cpp) | | TurboQuant | WHT butterfly | 128 | 16,384 | 16,384 | Production (llama.cpp) |
Each step traded algebraic richness for speed. The PPL results show the simpler rotations work better — confirming the paper's claim that block-diagonal rotation preserves the directional structure of KV cache vectors more effectively than global WHT scrambling.
Commit History
llama.cpp fork (feature/planarquant-kv-cache)
20efe75 2026-04-01 19:50 Add symmetric planar4/iso4: V dequant, template instances, FA dispatch
326f7fb 2026-04-01 14:41 Add inverse rotation V dequant for planar4/iso4
6e5a4aa 2026-04-01 14:24 Fix symmetric V=planar3/iso3: add inverse rotation to V dequant
a730624 2026-04-01 11:53 planar3/turbo3: 5x total compression, PPL 10.19 (vs Tom's 3.5x at 10.14)
b83a09f 2026-04-01 10:46 All 8 K/V configs working: real Givens/quaternion rotation for planar4/iso4
985fd96 2026-04-01 10:24 Fix planar3/q8_0 asymmetric: add F16+Q8_0 VEC template for deferred prefill
b719b2e 2026-04-01 10:07 Fix FA dispatch: static constants, V=f16 check, asymmetric support
79da661 2026-04-01 09:30 Add asymmetric FA kernels: q8_0 K + iso3/planar3 V (and reverse)
e7bde1f 2026-04-01 09:15 Guard deferred conversion behind GGML_USE_CUDA
9d4ece5 2026-04-01 08:32 COMPRESSION WORKS: 5.1x K-cache + 200 tok/s decode on CUDA
a75b16f 2026-04-01 07:51 Add CUDA flash attention dequantize for planar3/iso3/planar4/iso4
1ed0453 2026-04-01 06:53 Add CUDA set_rows kernels for planar3/iso3/planar4/iso4
0971ed5 2026-03-31 22:44 Fix ggml context size for double-buffer
25f896f 2026-03-31 22:37 Double-buffer deferred quantization with CUDA conversion kernels
rotorquant repo (main)
61154ae 2026-04-01 14:41 Update README: symmetric 3-bit PPL results beat TurboQuant
6ce8c03 2026-03-31 22:39 Add Llama 3.1 8B benchmarks: 239 tok/s decode, PPL 8.44
6637e30 2026-03-31 22:07 Update README with RTX 5090 llama.cpp CUDA benchmarks
ec98f4b 2026-03-31 21:12 Add post-prefill PPL benchmarks: IsoQuant 4-bit 9.03, PlanarQuant 3-bit 10.12
0c98c28 2026-03-31 21:04 Restore RotorQuant trivector centroids, add CUDA PPL to README
b9d3f1a 2026-03-31 20:16 Add IsoQuant + PlanarQuant backends to PPL benchmark
Quick Start
llama.cpp (recommended — fastest)
git clone https://github.com/johndpope/llama-cpp-turboquant.git
cd llama-cpp-turboquant && git checkout feature/planarquant-kv-cache
# CUDA
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
# Metal (Apple Silicon)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
# Symmetric 3-bit (best quality per bit)
./build/bin/llama-server -m model.gguf --jinja -ngl 99 \
--cache-type-k iso3 --cache-type-v iso3 --host 0.0.0.0 --port 8080
# K-only (zero PPL loss, 5x compression)
./build/bin/llama-server -m model.gguf --jinja -ngl 99 \
--cache-type-k planar3 --cache-type-v f16 --host 0.0.0.0 --port 8080
# Benchmark
./build/bin/llama-bench -m model.gguf -ngl 99 -ctk planar3 -ctv planar3 -p 512 -n 128
# Perplexity
pip install datasets
python3 -c "from datasets import load_dataset; open('/tmp/wiki.txt','w').write('\n'.join(load_dataset('wikitext','wikitext-2-raw-v1',split='test')['text']))"
./build/bin/llama-perplexity -m model.gguf -f /tmp/wiki.txt -ngl 99 -c 2048 \
--cache-type-k iso3 --cache-type-v iso3
Cache types: planar3, iso3, planar4, iso4 (ours) + turbo3, turbo4 (TheTom's WHT)
Python/Triton (research)
pip install -e . && pip install triton
from turboquant import IsoQuantMSE, PlanarQuantMSE
# IsoQuant: best 4-bit quality (PPL 9.03)
iq = IsoQuantMSE(d=128, bits=4, mode='fast', device='cuda')
x_hat, indices = iq(x)
# PlanarQuant: best 3-bit quality (PPL 10.12)
pq = PlanarQuantMSE(d=128, bits=3, device='cuda')
x_hat, indices = pq(x)
How It Works
Rotation decorrelates KV cache vectors before scalar quantization:
- Normalize → store norms separately
- Rotate via block transform (breaks coordinate correlations)
- Quantize each coordinate to Lloyd-Max centroids
- Inverse rotate to reconstruct
| | Block | FMAs (d=128) | Params | Quality | |---|-------|-------------|--------|---------| | TurboQuant | Dense d×d WHT | 16,384 | 16,384 | baseline | | IsoQuant | 4D quaternion | 512 | 128 | better | | PlanarQuant | 2D Givens | 256 | 128 | better |
Deferred quantization: K-cache allocates as FP16 during prefill (zero error compounding). Decode tokens get quantized on insertion. This gives 3x better PPL than roundtrip quantization — and in llama.cpp, the F16 prefill makes decode faster than FP16 baseline (no dequant overhead in flash attention).
Why inverse rotation matters for V cache: The V dequant must apply the inverse of the forward rotation (inverse Givens or inverse quaternion). TurboQuant's WHT doesn't need explicit inverse because of the self-canceling properties of Hadamard transforms in attention weighted sums. Our fix (6e5a4aa) added this — PPL went from 15,369 to 7.05.
VRAM Savings (3-bit symmetric, 10.3x compression)
| Context | FP16 KV | Compressed | Saved | |---------|---------|------------|-------| | 8K | 288 MB | 28 MB | 260 MB | | 32K | 1,152 MB | 112 MB | 1.04 GB | | 128K | 4,608 MB | 447 MB | 4.16 GB |
Needle-in-Haystack passes at 8K, 32K, and 65K context.
Additional Benchmarks
Qwen2.5-3B — K-only Decode Speed
| Hardware | Cache K | Decode tok/s | Prefill tok/s | PPL | |----------|---------|-------------|---------------|-----| | RTX 5090 | planar3 | 367 | 23,600 | 9.98 | | RTX 5090 | FP16 | 356 | 20,800 | 10.03 | | M4 Mac Mini | planar3 | 48.3 | 554 | 9.98 | | M4 Mac Mini | FP16 | 47.4 | 518 | 9.98 |
Perplexity — Python/Triton (Qwen2.5-3B, wikitext-2, post-prefill)
| Method | 3-bit PPL | 4-bit PPL | vs FP16 (7.59) | |--------|-----------|-----------|----------------| | IsoQuant | 12.35 | 9.03 | +19% | | PlanarQuant | 10.12 | 9.56 | +33% / +26% | | RotorQuant | 12.22 | 10.03 | +61% / +32% |
python -m turboquant.benchmark_google_parity # PPL (post-prefill)
python -m turboquant.benchmark_perplexity --bits 3 4 # PPL (roundtrip)
python -m turboquant.benchmark_triton # Triton kernel speed
python -m turboquant.poc_high_context --backend planar # High-context generation
Acknowledgments
ParaMind2025 — PlanarQuant (2D Givens rotation) and IsoQuant (4D quaternion rotation) were designed by ParaMind2025. Their insight that simple block-diagonal rotations could match full-rank transforms for KV cache decorrelation made the llama.cpp integration practical.
References
- RotorQuant paper — Clifford algebra vector quantization for KV cache compression
- TurboQuant (ICLR 2026) — Google's KV cache compression
Related Skills
node-connect
344.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
96.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
