TurboQuant
Near-optimal vector quantization from Google's ICLR 2026 paper — 95% recall, 5x compression, zero preprocessing, pure Python FAISS replacement
Install / Use
/learn @Firmamento-Technologies/TurboQuantREADME
TurboQuant
The only TurboQuant implementation for vector search — FAISS-compatible vector quantization library
180+ repos implemented Google's TurboQuant for KV cache compression. This is the only one built for vector similarity search.
A pure Python implementation of the TurboQuant algorithm (Zandieh et al., ICLR 2026) for FAISS-compatible vector quantization. Compress embedding vectors by 5-8x with 95%+ recall, zero preprocessing, and no GPU required. Includes both brute-force (TurboQuantIndex) and sub-linear ANN search (IVFTurboQuantIndex).
Why TurboQuant?
| Feature | FAISS PQ | ScaNN | TurboQuant | |---------|----------|-------|----------------| | Preprocessing | K-means (minutes) | Tree building (minutes) | None (instant) ¹ | | Recall@10 | ~60% | ~85% | 95.3% | | Compression | 8x | 4x | 5-8x | | Dependencies | C++/CUDA | C++/TensorFlow | Pure Python/NumPy | | Theory guarantee | None | None | 2.7x Shannon limit | | Training data needed | Yes | Yes | No (data-oblivious) ¹ | | Query complexity | O(N) brute-force | O(log N) | O(sqrt(N)) with IVF |
¹ TurboQuantIndex: no preprocessing, no training data. IVFTurboQuantIndex: requires K-means training on representative data.
Quick Start
Installation
pip install turboquant
Or from source:
git clone https://github.com/Firmamento-Technologies/TurboQuant.git
cd TurboQuant
pip install -e .
Basic Usage
from turboquant import TurboQuantIndex
import numpy as np
# Create index (drop-in FAISS replacement)
index = TurboQuantIndex(dimension=384, num_bits=6)
# Add vectors (auto-normalizes if needed)
vectors = np.random.randn(10000, 384).astype(np.float32)
index.add(vectors)
# Search
query = np.random.randn(1, 384).astype(np.float32)
similarities, indices = index.search(query, k=10)
# Save / Load
index.save("my_index")
loaded = TurboQuantIndex.load("my_index")
With IVF for Large Datasets
from turboquant import IVFTurboQuantIndex
import numpy as np
# Create IVF index for sub-linear search
index = IVFTurboQuantIndex(
dimension=384, num_bits=6,
nlist=100, # number of partitions
nprobe=10, # partitions to search per query
)
# Train partitioning (K-means on a sample)
training_data = np.random.randn(50000, 384).astype(np.float32)
index.train(training_data)
# Add vectors
index.add(training_data)
# Search (probes only nprobe partitions, not all N vectors)
query = np.random.randn(1, 384).astype(np.float32)
similarities, indices = index.search(query, k=10)
With Sentence Transformers
from sentence_transformers import SentenceTransformer
from turboquant import TurboQuantIndex
model = SentenceTransformer("all-MiniLM-L6-v2")
# Encode documents
docs = ["First document", "Second document", ...]
embeddings = model.encode(docs)
# Build compressed index
index = TurboQuantIndex(dimension=384, num_bits=6)
index.add(embeddings)
# Semantic search
query_emb = model.encode(["search query"])
similarities, doc_indices = index.search(query_emb, k=10)
Low-Level Quantizer API
from turboquant import TurboQuantMSE
# MSE-optimal quantizer (Algorithm 1 from paper)
tq = TurboQuantMSE(d=384, num_bits=4)
# Quantize
codes = tq.quantize(vectors) # (N, 384) float32 -> (N, 384) uint8
recon = tq.dequantize(codes) # (N, 384) uint8 -> (N, 384) float32
# Check quality
mse = np.mean(np.sum((vectors - recon)**2, axis=1))
cos_sim = np.mean(np.sum(vectors * recon, axis=1))
print(f"MSE: {mse:.6f}, Cosine Similarity: {cos_sim:.4f}")
Demo
<p align="center"> <img src="docs/turboquant_demo.gif" alt="TurboQuant Algorithm Animation" width="90%"> </p>8-second walkthrough: input vectors → random rotation → Lloyd-Max quantization → reconstruction quality. Full MP4
Performance
<p align="center"> <img src="docs/recall_comparison.png" alt="TurboQuant vs Product Quantization Recall" width="80%"> </p> <p align="center"> <img src="docs/compression_tradeoff.png" alt="Quality vs Compression Trade-off" width="70%"> </p>How It Works
<p align="center"> <img src="docs/algorithm_flow.png" alt="TurboQuant Algorithm Flow" width="90%"> </p>TurboQuant implements a mathematically elegant compression scheme from Google Research (arXiv:2504.19874):
Algorithm: TurboQuant_mse (Algorithm 1) + Lloyd-Max
Input vector x (d dimensions, unit norm)
|
v
[Random Rotation] -- Orthogonal matrix via QR decomposition
| Transforms x so coordinates follow Beta distribution
v
[Lloyd-Max Quantization] -- Optimal scalar quantizer per coordinate
| Pre-computed centroids for the Beta PDF
v
Compressed: b bits per coordinate (vs 32 bits original)
Key insight: After random rotation, each coordinate of a unit vector independently follows a known Beta distribution (converging to Gaussian for large d). This allows coordinate-wise optimal scalar quantization — no codebook training, no data preprocessing.
Theoretical Guarantees
From the paper (Theorems 1-3):
| Bits | MSE Distortion | Shannon Lower Bound | Ratio | |------|---------------|--------------------:|------:| | 1 | 0.363 | 0.250 | 1.45x | | 2 | 0.117 | 0.063 | 1.87x | | 3 | 0.034 | 0.016 | 2.20x | | 4 | 0.009 | 0.004 | 2.41x |
TurboQuant operates within 2.7x of the information-theoretic limit — provably near-optimal.
Why TurboQuant Outperforms Existing Methods
Results from the original paper (Section 4.4) show TurboQuant consistently outperforms both Product Quantization (used by FAISS, Pinecone, Elasticsearch) and RaBitQ (SIGMOD 2024 state-of-the-art) across all tested datasets.
Indexing Speed (Table 2 from paper, 4-bit, 100K vectors)
| Dimension | Product Quantization | RaBitQ | TurboQuant | Speedup vs PQ | |----------:|---------------------:|-------:|---------:|---:| | 200 (GloVe) | 37.04s | 597.25s | 0.0007s | 52,914x | | 1,536 (OpenAI3) | 239.75s | 2,267.59s | 0.0013s | 184,423x | | 3,072 (OpenAI3) | 494.42s | 3,957.19s | 0.0021s | 235,438x |
TurboQuant is 50,000–235,000x faster at indexing because it requires zero data-dependent preprocessing — no k-means, no codebook training, no grid optimization.
Recall (Figure 5 from paper)
The paper demonstrates TurboQuant achieves higher recall than both PQ and RaBitQ at the same compression level, across all three datasets (GloVe d=200, DBpedia d=1536, DBpedia d=3072).
Why This Matters
| Property | FAISS PQ | RaBitQ (SIGMOD 2024) | TurboQuant (ICLR 2026) | |----------|----------|------|-----------| | Recall | Baseline | Better than PQ | Best | | Indexing time | Seconds–minutes | Minutes | Microseconds | | Training data needed | Yes (k-means) | Yes (grid optimization) | No (data-oblivious) | | Theoretical guarantee | None (can fail on some datasets) | Error bound | 2.72x Shannon limit | | Online/streaming support | No (retrain on new data) | No | Brute-force: yes / IVF: after training | | Query complexity | O(N) | O(log N) | O(sqrt(N)) with IVF |
PQ is known to "fail disastrously on some real-world datasets" (RaBitQ paper, SIGMOD 2024) because it lacks theoretical error bounds. TurboQuant's data-oblivious design guarantees consistent quality regardless of data distribution.
Our Benchmark Results
Tested on all-MiniLM-L6-v2 embeddings (d=384):
| Bits | Recall@10 | Cosine Sim | Compression | Memory (10K vectors) | |------|-----------|-----------|-------------|---------------------| | 2 | 59.2% | 0.882 | 16.0x | 0.94 MB | | 3 | 77.6% | 0.965 | 10.7x | 1.40 MB | | 4 | 86.2% | 0.990 | 8.0x | 1.88 MB | | 5 | 92.6% | 0.997 | 6.4x | 2.34 MB | | 6 | 95.3% | 0.998 | 5.3x | 2.81 MB |
Baseline: float32 brute-force = 100% recall, 15.0 MB for 10K vectors
Recommended Configuration
| Use Case | Bits | Recall | Compression | |----------|------|--------|-------------| | Maximum compression (IoT, mobile) | 3 | 77.6% | 10.7x | | Balanced (default) | 4 | 86.2% | 8.0x | | High accuracy (RAG, search) | 6 | 95.3% | 5.3x | | Near-lossless | 8 | 99.5% | 4.0x |
Query Performance
Measured on a single-threaded environment (pure Python/NumPy):
| Index Type | N | Dimension | Query Latency (100 queries) | QPS | Recall@10 | |------------|---:|----------:|----------------------------:|----:|----------:| | TurboQuantIndex | 5K | 384 | 25ms | 3,991 | 96.0% | | TurboQuantIndex | 10K | 384 | 107ms | 939 | 95.6% |
Note: Pure Python/NumPy implementation. C++ libraries (FAISS, ScaNN) are significantly faster at scale. TurboQuant's advantages are zero-preprocessing, high recall, and minimal dependencies.
Recall on Different Distributions
Evaluated against exact float32 brute-force ground truth (N=5,000, d=384):
| Distribution | 4-bit Recall@10 | 6-bit Recall@10 | 8-bit Recall@10 | |-------------|----------------:|----------------:|----------------:| | Isotropic (best case) | 84.9% | 95.5% | 98.3% | | Clustered (20 clusters) | 85.8% | 95.9% | 98.0% | | Anisotropic (rank-50) | 83.9% | 93.9% | 97.7% |
*Isotropic matches Tur
