PyNear

Fast, exact and approximate K-nearest-neighbour search for Python

Installation
Quick start
- Approximate binary search (image descriptors)
Migrating from scikit-learn
Features
Demos
Benchmarks
Real-World Benchmark — SIFT1M Binary
Development

PyNear is a Python library with a C++ core for exact or approximate (fast) KNN search over metric spaces. It is built around Vantage Point Trees, a metric tree that scales well to higher dimensionalities where kd-trees degrade, and uses SIMD intrinsics (AVX2 on x86-64, portable fallbacks on arm64/Apple Silicon) to accelerate the hot distance computation paths. Already using scikit-learn's KNN? PyNear ships drop-in adapter classes that implement the same fit / predict / score / kneighbors API — migrate in one line.

Why PyNear?

| | PyNear | Faiss | Annoy | scikit-learn | |---|---|---|---|---| | Exact results | ✅ VPTree always | ✅ flat index | ❌ approximate | ✅ | | Approximate (fast, tunable) | ✅ IVFFlatL2Index | ✅ IVF | ✅ | ❌ | | Metric agnostic | ✅ L2, L1, L∞, Hamming | L2 / inner product | L2 / cosine / Hamming | L2 / others | | Low-dim sweet spot | ✅ | ❌ | ❌ | ❌ | | High-dim (512-D – 1024-D) | ✅ IVFFlatL2Index | ✅ | ✅ | ❌ | | Binary / Hamming exact | ✅ hardware popcount | ✅ | ✅ | ❌ | | Binary / Hamming approx | ✅ MIH + IVFFlat | ⚠️ slow build | ❌ | ❌ | | Threshold / range search | ✅ BKTree | ❌ | ❌ | ❌ | | Pickle serialization | ✅ | ❌ | ✅ | ✅ | | No extra native deps | ✅ NumPy only | ❌ compiled lib + optional GPU | ❌ | ❌ | | scikit-learn compatible API | ✅ drop-in adapters | ❌ | ❌ | — |

PyNear covers the full spectrum: use VPTree indices when you need guaranteed exact answers (2-D to ~256-D), IVFFlatL2Index for fast approximate float search on high-dimensional data (512-D to 1024-D), or MIHBinaryIndex / IVFFlatBinaryIndex for approximate Hamming search on binary descriptors — achieving up to 257× speedup over exact binary brute-force at N=1M, d=512 with 100% Recall@10.

For the Layman

K-Nearest Neighbours (KNN) is simply the idea of finding the k most similar items to a given query in a collection.

Think of it like asking: "given this song I like, what are the 5 most similar songs in my library?" The algorithm measures the "distance" between items (how different they are) and returns the closest ones.

The two key parameters are:

k — how many neighbours to return (e.g. the 5 most similar)
distance metric — how "similarity" is measured (e.g. Euclidean, Manhattan, Hamming)

Everything else — VP-Trees, SIMD, approximate search — is just engineering to make that search fast at scale.

Main applications of KNN search

Image retrieval — finding visually similar images by searching nearest neighbours in an embedding space (e.g. face recognition, reverse image search).
Recommendation systems — suggesting similar items (products, songs, articles) by finding the closest user or item embeddings.
Anomaly detection — flagging data points whose nearest neighbours are unusually distant as potential outliers or fraud cases.
Semantic search — retrieving documents or passages whose dense vector representations are closest to a query embedding (e.g. RAG pipelines).
Broad-phase collision detection — quickly finding candidate object pairs that might be colliding by looking up the nearest neighbours of each object's bounding volume, before running the expensive narrow-phase test.
Soft body / cloth simulation — finding the nearest mesh vertices or particles to resolve contact constraints and self-collision.
Particle systems (SPH, fluid sim) — each particle needs to know its neighbours within a radius to compute pressure and density forces.

Installation

pip install pynear

Requires Python 3.8+ and NumPy ≥ 1.21.2. Pre-built wheels are available for Linux, macOS (x86-64 and Apple Silicon), and Windows — no compiler needed.

Quick start

import numpy as np
import pynear

# Build index from 100 000 vectors of dimension 32
data = np.random.rand(100_000, 32).astype(np.float32)
index = pynear.VPTreeL2Index()
index.set(data)

# KNN search — returns (indices, distances) per query, sorted nearest-first
queries = np.random.rand(10, 32).astype(np.float32)
indices, distances = index.searchKNN(queries, k=5)

# 1-NN shortcut (slightly faster than searchKNN with k=1)
nn_indices, nn_distances = index.search1NN(queries)

For all index types and advanced usage see docs/README.md.

Approximate binary search (image descriptors)

For large-scale image retrieval with binary descriptors (ORB, BRIEF, AKAZE), PyNear provides two approximate Hamming-distance indices that are orders of magnitude faster than exact brute-force:

import numpy as np
import pynear

# 1M × 512-bit descriptors (64 bytes each)
db = np.random.randint(0, 256, size=(1_000_000, 64), dtype=np.uint8)

# ── Multi-Index Hashing ───────────────────────────────────────────────────────
# Best for d=512 (m=8 sub-tables of 64 bits).
# 257× faster than brute-force at N=1M; 100% Recall@10 for near-duplicates.
mih = pynear.MIHBinaryIndex(m=8)   # m=4 for d=128/256, m=8 for d=512
mih.set(db)

queries = np.random.randint(0, 256, size=(100, 64), dtype=np.uint8)
indices, distances = mih.searchKNN(queries, k=10, radius=8)
# radius: any true neighbour within Hamming distance ≤ radius is guaranteed
# to be found (pigeonhole principle). Increase for higher recall on noisier data.

# ── IVF Flat Binary ───────────────────────────────────────────────────────────
# Predictable cost: scans nprobe clusters per query.
# Good when the query radius is unknown or data is non-uniform.
ivf = pynear.IVFFlatBinaryIndex(nlist=512, nprobe=16)
ivf.set(db)

indices, distances = ivf.searchKNN(queries, k=10)
ivf.set_nprobe(32)  # increase nprobe at runtime to trade speed for recall

Choosing between MIH and IVFFlat:

| | MIHBinaryIndex | IVFFlatBinaryIndex | |---|---|---| | Best for | Near-duplicate retrieval (small Hamming radius) | General approximate Hamming KNN | | d=512, N=1M query time | 0.037 ms | 1.95 ms | | Recall guarantee | Exact for distance ≤ radius (pigeonhole) | Probabilistic (depends on nprobe) | | Recall control | radius parameter | nprobe parameter | | Recommended m | d/8 bytes (e.g. m=8 for 512-bit) | — |

Migrating from scikit-learn

PyNear provides adapter classes that implement the same interface as sklearn.neighbors.NearestNeighbors, KNeighborsClassifier, and KNeighborsRegressor. Changing the import is all that is required in most cases:

# Before
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5, metric='euclidean')

# After — identical API, backed by a VP-Tree
from pynear.sklearn_adapter import PyNearKNeighborsClassifier
clf = PyNearKNeighborsClassifier(n_neighbors=5, metric='euclidean')

All three adapters follow the standard scikit-learn workflow:

from pynear.sklearn_adapter import (
    PyNearNearestNeighbors,
    PyNearKNeighborsClassifier,
    PyNearKNeighborsRegressor,
)

# Unsupervised neighbour lookup
nn = PyNearNearestNeighbors(n_neighbors=5, metric='euclidean')
nn.fit(X_train)
distances, indices = nn.kneighbors(X_query)

# Classification
clf = PyNearKNeighborsClassifier(n_neighbors=5, weights='distance')
clf.fit(X_train, y_train)
clf.predict(X_test)          # class labels
clf.predict_proba(X_test)    # per-class probabilities
clf.score(X_test, y_test)    # accuracy

# Regression
reg = PyNearKNeighborsRegressor(n_neighbors=5, weights='uniform')
reg.fit(X_train, y_train)
reg.predict(X_test)          # predicted values
reg.score(X_test, y_test)    # R²

Supported metrics: euclidean / l2, manhattan / l1, chebyshev / linf, hamming

Supported weights: uniform, distance (inverse-distance-weighted)

Note: Input arrays are cast to float32 (or uint8 for Hamming) before indexing. scikit-learn uses float64 internally, so very small numerical differences may appear at the precision boundary, but nearest-neighbour results are identical for all practical datasets.

Features

Available indices

Exact indices — always return the true k nearest neighbours:

| Index | Distance | Data type | Notes | |---|---|---|---| | VPTreeL2Index | L2 (Euclidean) | float32 | SIMD-accelerated | | VPTreeL1Index | L1 (Manhattan) | float32 | SIMD-accelerated | | VPTreeChebyshevIndex | L∞ (Chebyshev) | float32 | SIMD-accelerated | | VPTreeBinaryIndex | Hamming | uint8 | Hardware popcount | | BKTreeBinaryIndex | Hamming | uint8 | Threshold / range search |

Approximate indices — trade a small recall budget for large speed gains; tunable via n_probe / radius:

| Index | Distance | Data type | Notes | |---|---|---|---| | IVFFlatL2Index | L2 (Euclidean) | `flo

Pynear

Install / Use

README