SkillAgentSearch skills...

TurboQuant

Near-optimal vector quantization from Google's ICLR 2026 paper — 95% recall, 5x compression, zero preprocessing, pure Python FAISS replacement

Install / Use

/learn @Firmamento-Technologies/TurboQuant

README

<p align="center"> <img src="docs/hero_banner.png" alt="TurboQuant — Near-Optimal Vector Quantization for AI" width="100%"> </p>

TurboQuant

The only TurboQuant implementation for vector search — FAISS-compatible vector quantization library

License Python 3.10+ Tests Paper Verified Paper

180+ repos implemented Google's TurboQuant for KV cache compression. This is the only one built for vector similarity search.

A pure Python implementation of the TurboQuant algorithm (Zandieh et al., ICLR 2026) for FAISS-compatible vector quantization. Compress embedding vectors by 5-8x with 95%+ recall, zero preprocessing, and no GPU required. Includes both brute-force (TurboQuantIndex) and sub-linear ANN search (IVFTurboQuantIndex).

Why TurboQuant?

| Feature | FAISS PQ | ScaNN | TurboQuant | |---------|----------|-------|----------------| | Preprocessing | K-means (minutes) | Tree building (minutes) | None (instant) ¹ | | Recall@10 | ~60% | ~85% | 95.3% | | Compression | 8x | 4x | 5-8x | | Dependencies | C++/CUDA | C++/TensorFlow | Pure Python/NumPy | | Theory guarantee | None | None | 2.7x Shannon limit | | Training data needed | Yes | Yes | No (data-oblivious) ¹ | | Query complexity | O(N) brute-force | O(log N) | O(sqrt(N)) with IVF |

¹ TurboQuantIndex: no preprocessing, no training data. IVFTurboQuantIndex: requires K-means training on representative data.

Quick Start

Installation

pip install turboquant

Or from source:

git clone https://github.com/Firmamento-Technologies/TurboQuant.git
cd TurboQuant
pip install -e .

Basic Usage

from turboquant import TurboQuantIndex
import numpy as np

# Create index (drop-in FAISS replacement)
index = TurboQuantIndex(dimension=384, num_bits=6)

# Add vectors (auto-normalizes if needed)
vectors = np.random.randn(10000, 384).astype(np.float32)
index.add(vectors)

# Search
query = np.random.randn(1, 384).astype(np.float32)
similarities, indices = index.search(query, k=10)

# Save / Load
index.save("my_index")
loaded = TurboQuantIndex.load("my_index")

With IVF for Large Datasets

from turboquant import IVFTurboQuantIndex
import numpy as np

# Create IVF index for sub-linear search
index = IVFTurboQuantIndex(
    dimension=384, num_bits=6,
    nlist=100,    # number of partitions
    nprobe=10,    # partitions to search per query
)

# Train partitioning (K-means on a sample)
training_data = np.random.randn(50000, 384).astype(np.float32)
index.train(training_data)

# Add vectors
index.add(training_data)

# Search (probes only nprobe partitions, not all N vectors)
query = np.random.randn(1, 384).astype(np.float32)
similarities, indices = index.search(query, k=10)

With Sentence Transformers

from sentence_transformers import SentenceTransformer
from turboquant import TurboQuantIndex

model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode documents
docs = ["First document", "Second document", ...]
embeddings = model.encode(docs)

# Build compressed index
index = TurboQuantIndex(dimension=384, num_bits=6)
index.add(embeddings)

# Semantic search
query_emb = model.encode(["search query"])
similarities, doc_indices = index.search(query_emb, k=10)

Low-Level Quantizer API

from turboquant import TurboQuantMSE

# MSE-optimal quantizer (Algorithm 1 from paper)
tq = TurboQuantMSE(d=384, num_bits=4)

# Quantize
codes = tq.quantize(vectors)      # (N, 384) float32 -> (N, 384) uint8
recon = tq.dequantize(codes)      # (N, 384) uint8 -> (N, 384) float32

# Check quality
mse = np.mean(np.sum((vectors - recon)**2, axis=1))
cos_sim = np.mean(np.sum(vectors * recon, axis=1))
print(f"MSE: {mse:.6f}, Cosine Similarity: {cos_sim:.4f}")

Demo

<p align="center"> <img src="docs/turboquant_demo.gif" alt="TurboQuant Algorithm Animation" width="90%"> </p>

8-second walkthrough: input vectors → random rotation → Lloyd-Max quantization → reconstruction quality. Full MP4

Performance

<p align="center"> <img src="docs/recall_comparison.png" alt="TurboQuant vs Product Quantization Recall" width="80%"> </p> <p align="center"> <img src="docs/compression_tradeoff.png" alt="Quality vs Compression Trade-off" width="70%"> </p>

How It Works

<p align="center"> <img src="docs/algorithm_flow.png" alt="TurboQuant Algorithm Flow" width="90%"> </p>

TurboQuant implements a mathematically elegant compression scheme from Google Research (arXiv:2504.19874):

Algorithm: TurboQuant_mse (Algorithm 1) + Lloyd-Max

Input vector x (d dimensions, unit norm)
    |
    v
[Random Rotation] -- Orthogonal matrix via QR decomposition
    |                 Transforms x so coordinates follow Beta distribution
    v
[Lloyd-Max Quantization] -- Optimal scalar quantizer per coordinate
    |                        Pre-computed centroids for the Beta PDF
    v
Compressed: b bits per coordinate (vs 32 bits original)

Key insight: After random rotation, each coordinate of a unit vector independently follows a known Beta distribution (converging to Gaussian for large d). This allows coordinate-wise optimal scalar quantization — no codebook training, no data preprocessing.

Theoretical Guarantees

From the paper (Theorems 1-3):

| Bits | MSE Distortion | Shannon Lower Bound | Ratio | |------|---------------|--------------------:|------:| | 1 | 0.363 | 0.250 | 1.45x | | 2 | 0.117 | 0.063 | 1.87x | | 3 | 0.034 | 0.016 | 2.20x | | 4 | 0.009 | 0.004 | 2.41x |

TurboQuant operates within 2.7x of the information-theoretic limit — provably near-optimal.

Why TurboQuant Outperforms Existing Methods

Results from the original paper (Section 4.4) show TurboQuant consistently outperforms both Product Quantization (used by FAISS, Pinecone, Elasticsearch) and RaBitQ (SIGMOD 2024 state-of-the-art) across all tested datasets.

Indexing Speed (Table 2 from paper, 4-bit, 100K vectors)

| Dimension | Product Quantization | RaBitQ | TurboQuant | Speedup vs PQ | |----------:|---------------------:|-------:|---------:|---:| | 200 (GloVe) | 37.04s | 597.25s | 0.0007s | 52,914x | | 1,536 (OpenAI3) | 239.75s | 2,267.59s | 0.0013s | 184,423x | | 3,072 (OpenAI3) | 494.42s | 3,957.19s | 0.0021s | 235,438x |

TurboQuant is 50,000–235,000x faster at indexing because it requires zero data-dependent preprocessing — no k-means, no codebook training, no grid optimization.

Recall (Figure 5 from paper)

The paper demonstrates TurboQuant achieves higher recall than both PQ and RaBitQ at the same compression level, across all three datasets (GloVe d=200, DBpedia d=1536, DBpedia d=3072).

Why This Matters

| Property | FAISS PQ | RaBitQ (SIGMOD 2024) | TurboQuant (ICLR 2026) | |----------|----------|------|-----------| | Recall | Baseline | Better than PQ | Best | | Indexing time | Seconds–minutes | Minutes | Microseconds | | Training data needed | Yes (k-means) | Yes (grid optimization) | No (data-oblivious) | | Theoretical guarantee | None (can fail on some datasets) | Error bound | 2.72x Shannon limit | | Online/streaming support | No (retrain on new data) | No | Brute-force: yes / IVF: after training | | Query complexity | O(N) | O(log N) | O(sqrt(N)) with IVF |

PQ is known to "fail disastrously on some real-world datasets" (RaBitQ paper, SIGMOD 2024) because it lacks theoretical error bounds. TurboQuant's data-oblivious design guarantees consistent quality regardless of data distribution.

Our Benchmark Results

Tested on all-MiniLM-L6-v2 embeddings (d=384):

| Bits | Recall@10 | Cosine Sim | Compression | Memory (10K vectors) | |------|-----------|-----------|-------------|---------------------| | 2 | 59.2% | 0.882 | 16.0x | 0.94 MB | | 3 | 77.6% | 0.965 | 10.7x | 1.40 MB | | 4 | 86.2% | 0.990 | 8.0x | 1.88 MB | | 5 | 92.6% | 0.997 | 6.4x | 2.34 MB | | 6 | 95.3% | 0.998 | 5.3x | 2.81 MB |

Baseline: float32 brute-force = 100% recall, 15.0 MB for 10K vectors

Recommended Configuration

| Use Case | Bits | Recall | Compression | |----------|------|--------|-------------| | Maximum compression (IoT, mobile) | 3 | 77.6% | 10.7x | | Balanced (default) | 4 | 86.2% | 8.0x | | High accuracy (RAG, search) | 6 | 95.3% | 5.3x | | Near-lossless | 8 | 99.5% | 4.0x |

Query Performance

Measured on a single-threaded environment (pure Python/NumPy):

| Index Type | N | Dimension | Query Latency (100 queries) | QPS | Recall@10 | |------------|---:|----------:|----------------------------:|----:|----------:| | TurboQuantIndex | 5K | 384 | 25ms | 3,991 | 96.0% | | TurboQuantIndex | 10K | 384 | 107ms | 939 | 95.6% |

Note: Pure Python/NumPy implementation. C++ libraries (FAISS, ScaNN) are significantly faster at scale. TurboQuant's advantages are zero-preprocessing, high recall, and minimal dependencies.

Recall on Different Distributions

Evaluated against exact float32 brute-force ground truth (N=5,000, d=384):

| Distribution | 4-bit Recall@10 | 6-bit Recall@10 | 8-bit Recall@10 | |-------------|----------------:|----------------:|----------------:| | Isotropic (best case) | 84.9% | 95.5% | 98.3% | | Clustered (20 clusters) | 85.8% | 95.9% | 98.0% | | Anisotropic (rank-50) | 83.9% | 93.9% | 97.7% |

*Isotropic matches Tur

View on GitHub
GitHub Stars14
CategoryData
Updated4d ago
Forks0

Languages

Python

Security Score

95/100

Audited on Apr 2, 2026

No findings