Pynear
A python library for efficient KNN search within metric spaces using multiple distance functions.
Install / Use
/learn @pablocael/PynearREADME
PyNear
Fast, exact and approximate K-nearest-neighbour search for Python
Table of Contents
- Installation
- Quick start
- Migrating from scikit-learn
- Features
- Demos
- Benchmarks
- Real-World Benchmark — SIFT1M Binary
- Development
<video src="https://github.com/user-attachments/assets/4e07e82d-f3a0-4d47-9d34-a5ea4f752e14" autoplay muted loop width="100%"></video>
PyNear is a Python library with a C++ core for exact or approximate (fast) KNN search over metric
spaces. It is built around Vantage Point Trees, a metric
tree that scales well to higher dimensionalities where kd-trees degrade, and
uses SIMD intrinsics (AVX2 on x86-64, portable fallbacks on arm64/Apple
Silicon) to accelerate the hot distance computation paths.
Already using scikit-learn's KNN? PyNear ships drop-in adapter classes that
implement the same fit / predict / score / kneighbors API —
migrate in one line.
Why PyNear?
| | PyNear | Faiss | Annoy | scikit-learn | |---|---|---|---|---| | Exact results | ✅ VPTree always | ✅ flat index | ❌ approximate | ✅ | | Approximate (fast, tunable) | ✅ IVFFlatL2Index | ✅ IVF | ✅ | ❌ | | Metric agnostic | ✅ L2, L1, L∞, Hamming | L2 / inner product | L2 / cosine / Hamming | L2 / others | | Low-dim sweet spot | ✅ | ❌ | ❌ | ❌ | | High-dim (512-D – 1024-D) | ✅ IVFFlatL2Index | ✅ | ✅ | ❌ | | Binary / Hamming exact | ✅ hardware popcount | ✅ | ✅ | ❌ | | Binary / Hamming approx | ✅ MIH + IVFFlat | ⚠️ slow build | ❌ | ❌ | | Threshold / range search | ✅ BKTree | ❌ | ❌ | ❌ | | Pickle serialization | ✅ | ❌ | ✅ | ✅ | | No extra native deps | ✅ NumPy only | ❌ compiled lib + optional GPU | ❌ | ❌ | | scikit-learn compatible API | ✅ drop-in adapters | ❌ | ❌ | — |
PyNear covers the full spectrum: use VPTree indices when you need guaranteed exact answers (2-D to ~256-D), IVFFlatL2Index for fast approximate float search on high-dimensional data (512-D to 1024-D), or MIHBinaryIndex / IVFFlatBinaryIndex for approximate Hamming search on binary descriptors — achieving up to 257× speedup over exact binary brute-force at N=1M, d=512 with 100% Recall@10.
For the Layman
K-Nearest Neighbours (KNN) is simply the idea of finding the k most similar items to a given query in a collection.
Think of it like asking: "given this song I like, what are the 5 most similar songs in my library?" The algorithm measures the "distance" between items (how different they are) and returns the closest ones.
The two key parameters are:
- k — how many neighbours to return (e.g. the 5 most similar)
- distance metric — how "similarity" is measured (e.g. Euclidean, Manhattan, Hamming)
Everything else — VP-Trees, SIMD, approximate search — is just engineering to make that search fast at scale.
Main applications of KNN search
- Image retrieval — finding visually similar images by searching nearest neighbours in an embedding space (e.g. face recognition, reverse image search).
- Recommendation systems — suggesting similar items (products, songs, articles) by finding the closest user or item embeddings.
- Anomaly detection — flagging data points whose nearest neighbours are unusually distant as potential outliers or fraud cases.
- Semantic search — retrieving documents or passages whose dense vector representations are closest to a query embedding (e.g. RAG pipelines).
- Broad-phase collision detection — quickly finding candidate object pairs that might be colliding by looking up the nearest neighbours of each object's bounding volume, before running the expensive narrow-phase test.
- Soft body / cloth simulation — finding the nearest mesh vertices or particles to resolve contact constraints and self-collision.
- Particle systems (SPH, fluid sim) — each particle needs to know its neighbours within a radius to compute pressure and density forces.
Installation
pip install pynear
Requires Python 3.8+ and NumPy ≥ 1.21.2. Pre-built wheels are available for Linux, macOS (x86-64 and Apple Silicon), and Windows — no compiler needed.
Quick start
import numpy as np
import pynear
# Build index from 100 000 vectors of dimension 32
data = np.random.rand(100_000, 32).astype(np.float32)
index = pynear.VPTreeL2Index()
index.set(data)
# KNN search — returns (indices, distances) per query, sorted nearest-first
queries = np.random.rand(10, 32).astype(np.float32)
indices, distances = index.searchKNN(queries, k=5)
# 1-NN shortcut (slightly faster than searchKNN with k=1)
nn_indices, nn_distances = index.search1NN(queries)
For all index types and advanced usage see docs/README.md.
Approximate binary search (image descriptors)
For large-scale image retrieval with binary descriptors (ORB, BRIEF, AKAZE), PyNear provides two approximate Hamming-distance indices that are orders of magnitude faster than exact brute-force:
import numpy as np
import pynear
# 1M × 512-bit descriptors (64 bytes each)
db = np.random.randint(0, 256, size=(1_000_000, 64), dtype=np.uint8)
# ── Multi-Index Hashing ───────────────────────────────────────────────────────
# Best for d=512 (m=8 sub-tables of 64 bits).
# 257× faster than brute-force at N=1M; 100% Recall@10 for near-duplicates.
mih = pynear.MIHBinaryIndex(m=8) # m=4 for d=128/256, m=8 for d=512
mih.set(db)
queries = np.random.randint(0, 256, size=(100, 64), dtype=np.uint8)
indices, distances = mih.searchKNN(queries, k=10, radius=8)
# radius: any true neighbour within Hamming distance ≤ radius is guaranteed
# to be found (pigeonhole principle). Increase for higher recall on noisier data.
# ── IVF Flat Binary ───────────────────────────────────────────────────────────
# Predictable cost: scans nprobe clusters per query.
# Good when the query radius is unknown or data is non-uniform.
ivf = pynear.IVFFlatBinaryIndex(nlist=512, nprobe=16)
ivf.set(db)
indices, distances = ivf.searchKNN(queries, k=10)
ivf.set_nprobe(32) # increase nprobe at runtime to trade speed for recall
Choosing between MIH and IVFFlat:
| | MIHBinaryIndex | IVFFlatBinaryIndex |
|---|---|---|
| Best for | Near-duplicate retrieval (small Hamming radius) | General approximate Hamming KNN |
| d=512, N=1M query time | 0.037 ms | 1.95 ms |
| Recall guarantee | Exact for distance ≤ radius (pigeonhole) | Probabilistic (depends on nprobe) |
| Recall control | radius parameter | nprobe parameter |
| Recommended m | d/8 bytes (e.g. m=8 for 512-bit) | — |
Migrating from scikit-learn
PyNear provides adapter classes that implement the same interface as
sklearn.neighbors.NearestNeighbors, KNeighborsClassifier, and
KNeighborsRegressor. Changing the import is all that is required in most
cases:
# Before
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
# After — identical API, backed by a VP-Tree
from pynear.sklearn_adapter import PyNearKNeighborsClassifier
clf = PyNearKNeighborsClassifier(n_neighbors=5, metric='euclidean')
All three adapters follow the standard scikit-learn workflow:
from pynear.sklearn_adapter import (
PyNearNearestNeighbors,
PyNearKNeighborsClassifier,
PyNearKNeighborsRegressor,
)
# Unsupervised neighbour lookup
nn = PyNearNearestNeighbors(n_neighbors=5, metric='euclidean')
nn.fit(X_train)
distances, indices = nn.kneighbors(X_query)
# Classification
clf = PyNearKNeighborsClassifier(n_neighbors=5, weights='distance')
clf.fit(X_train, y_train)
clf.predict(X_test) # class labels
clf.predict_proba(X_test) # per-class probabilities
clf.score(X_test, y_test) # accuracy
# Regression
reg = PyNearKNeighborsRegressor(n_neighbors=5, weights='uniform')
reg.fit(X_train, y_train)
reg.predict(X_test) # predicted values
reg.score(X_test, y_test) # R²
Supported metrics: euclidean / l2, manhattan / l1, chebyshev / linf, hamming
Supported weights: uniform, distance (inverse-distance-weighted)
Note: Input arrays are cast to
float32(oruint8for Hamming) before indexing. scikit-learn usesfloat64internally, so very small numerical differences may appear at the precision boundary, but nearest-neighbour results are identical for all practical datasets.
Features
Available indices
Exact indices — always return the true k nearest neighbours:
| Index | Distance | Data type | Notes |
|---|---|---|---|
| VPTreeL2Index | L2 (Euclidean) | float32 | SIMD-accelerated |
| VPTreeL1Index | L1 (Manhattan) | float32 | SIMD-accelerated |
| VPTreeChebyshevIndex | L∞ (Chebyshev) | float32 | SIMD-accelerated |
| VPTreeBinaryIndex | Hamming | uint8 | Hardware popcount |
| BKTreeBinaryIndex | Hamming | uint8 | Threshold / range search |
Approximate indices — trade a small recall budget for large speed gains; tunable via n_probe / radius:
| Index | Distance | Data type | Notes |
|---|---|---|---|
| IVFFlatL2Index | L2 (Euclidean) | `flo
