Comet

Cover

A high-performance hybrid vector store written in Go. Comet brings together multiple indexing strategies and search modalities into a unified, hackable package. Hybrid retrieval with reciprocal rank fusion, autocut, pre-filtering, semantic search, full-text search, and multi-KNN searches, and multi-query operations — all in pure Go.

Understand search internals from the inside out. Built for hackers, not hyperscalers. Tiny enough to fit in your head. Decent enough to blow it.

Choose from:

Flat (exact), HNSW (graph), IVF (clustering), PQ (quantization), or IVFPQ (hybrid) storage backends
Full-Text Search: BM25 ranking algorithm with tokenization and normalization
Metadata Filtering: Fast filtering using Roaring Bitmaps and Bit-Sliced Indexes
Ranking Programmability: Reciprocal Rank Fusion, Fixed size result sets, Threshold based result sets, Dynamic result sets etc.
Hybrid Search: Unified interface combining vector, text, and metadata search

Overview
Features
Installation
Quick Start
Architecture
Core Concepts
API Reference
Examples
Configuration
API Details
Use Cases
Contributing
License

Overview

Everything you need to understand how vector databases actually work—and build one yourself.

What's inside:

5 Vector Storage Types: Flat, HNSW, IVF, PQ, IVFPQ
3 Distance Metrics: L2, L2 Squared, Cosine
Full-Text Search: BM25 ranking with Unicode tokenization
Metadata Filtering: Roaring bitmaps + Bit-Sliced Indexes
Hybrid Search: Combine vector + text + metadata with Reciprocal Rank Fusion
Advanced Search: Multi-KNN queries, multi-query operations, autocut result truncation
Production Features: Thread-safe, serialization, soft deletes, configurable parameters

Everything you need to understand how vector databases actually work—and build one yourself.

Features

Vector Storage

Flat: Brute-force exact search (100% recall baseline)
HNSW: Hierarchical navigable small world graphs (95-99% recall, O(log n) search)
IVF: Inverted file index with k-means clustering (85-95% recall, 10-20x speedup)
PQ: Product quantization for compression (85-95% recall, 10-500x memory reduction)
IVFPQ: IVF + PQ combined (85-95% recall, 100x speedup + 500x compression)

Search Modalities

Vector Search: L2, L2 Squared, and Cosine distance metrics
Full-Text Search: BM25 ranking with Unicode-aware tokenization
Metadata Filtering: Boolean queries on structured attributes
Hybrid Search: Combine all three with configurable fusion strategies

Fusion Strategies

Weighted Sum: Linear combination with configurable weights
Reciprocal Rank Fusion (RRF): Scale-independent rank-based fusion
Max/Min Score: Simple score aggregation

Data Structures (The Good Stuff)

HNSW Graphs: Multi-layer skip lists for approximate nearest neighbor search
Roaring Bitmaps: Compressed bitmaps for metadata filtering (array, bitmap, run-length encoding)
Bit-Sliced Index (BSI): Efficient numeric range queries without full scans
Product Quantization Codebooks: Learned k-means centroids for vector compression
Inverted Indexes: Token-to-document mappings for full-text search

Other Capabilities

Quantization: Full precision, half precision, int8 precision
Soft Deletes: Fast deletion with lazy cleanup
Serialization: Persist and reload indexes
Thread-Safe: Concurrent read/write operations
Autocut: Automatic result truncation based on score gaps

Installation

go get github.com/wizenheimer/comet

Quick Start

package main

import (
    "fmt"
    "log"

    "github.com/wizenheimer/comet"
)

func main() {
    // Create a vector store (384-dimensional embeddings with cosine distance)
    index, err := comet.NewFlatIndex(384, comet.Cosine)
    if err != nil {
        log.Fatal(err)
    }

    // Add vectors
    vec1 := make([]float32, 384)
    // ... populate vec1 with your embedding ...
    node := comet.NewVectorNode(vec1)
    index.Add(*node)

    // Search for similar vectors
    query := make([]float32, 384)
    // ... populate query vector ...
    results, err := index.NewSearch().
        WithQuery(query).
        WithK(10).
        Execute()

    if err != nil {
        log.Fatal(err)
    }

    // Process results
    for i, result := range results {
        fmt.Printf("%d. ID=%d, Score=%.4f\n", i+1, result.GetId(), result.GetScore())
    }
}

Output:

1. ID=123, Score=0.0234
2. ID=456, Score=0.0567
3. ID=789, Score=0.0823
...

Architecture

System Architecture

Comet is organized into three main search engines that can work independently or together:

Application Layer

┌─────────────────────────────────────────────────────────────┐
│                    Your Application                          │
│            (Using Comet as a Go Library)                     │
└──────────────────────┬──────────────────────────────────────┘
                       │
         ┌─────────────┼─────────────┐
         │             │             │
         ▼             ▼             ▼
    Vector         Text         Metadata

Search Engine Layer

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Vector    │    │    Text     │    │  Metadata   │
│   Search    │    │   Search    │    │  Filtering  │
│   Engine    │    │   Engine    │    │   Engine    │
└──────┬──────┘    └──────┬──────┘    └──────┬──────┘
       │                  │                  │
       │ Semantic         │ Keywords         │ Filters
       │ Similarity       │ + Relevance      │ + Boolean Logic
       ▼                  ▼                  ▼

Index Storage Layer

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ HNSW / IVF  │    │ BM25 Index  │    │  Roaring    │
│ / PQ / Flat │    │ (Inverted)  │    │  Bitmaps    │
└─────────────┘    └─────────────┘    └─────────────┘
   Graph/Trees      Token→DocIDs       Compressed Sets

Hybrid Coordinator

                 All Three Engines
                       │
                       ▼
              ┌─────────────────┐
              │  Hybrid Search   │
              │  Coordinator     │
              │  (Score Fusion)  │
              └─────────────────┘
                       │
                       ▼
              Combined Results

Component Details

Component A: Vector Storage Engine

Manages vector storage and similarity search across multiple index types.

Common Interface:

┌────────────────────────────────────┐
│  VectorIndex Interface             │
│                                    │
│  ├─ Train(vectors)                 │
│  ├─ Add(vector)                    │
│  ├─ Remove(vector)                 │
│  └─ NewSearch()                    │
└────────────────────────────────────┘

Available Implementations:

FlatIndex          → Brute force, 100% recall
HNSWIndex          → Graph-based, O(log n)
IVFIndex           → Clustering, 10-20x faster
PQIndex            → Quantization, 10-500x compression
IVFPQIndex         → Hybrid, best of IVF + PQ

Responsibilities:

Vector preprocessing (normalization for cosine distance)
Distance calculations (Euclidean, L2², Cosine)
K-nearest neighbor search
Serialization/deserialization
Soft delete management with flush mechanism

Performance Characteristics:

Flat: O(n×d) search, 100% recall
HNSW: O(M×ef×log n) search, 95-99% recall
IVF: O(nProbes×n/k×d) search, 85-95% recall

Component B: Text Search Engine

Full-text search using BM25 ranking algorithm.

Inverted Index:

┌────────────────────────────────────┐
│  term → RoaringBitmap(docIDs)      │
│                                    │
│  "machine"  →  {1, 5, 12, 45}      │
│  "learning" →  {1, 3, 12, 20}      │
│  "neural"   →  {3, 20, 45}         │
└────────────────────────────────────┘

Term Frequencies:

┌────────────────────────────────────┐
│  term → {docID: count}             │
│                                    │
│  "machine" → {1: 3, 5: 1, 12: 2}   │
└────────────────────────────────────┘

Document Stats:

┌────────────────────────────────────┐
│  docID → (length, token_count)     │
│                                    │
│  1  →  (250 chars, 45 tokens)      │
│  5  →  (180 chars, 32 tokens)      │
└────────────────────────────────────┘

Responsibilities:

Text tokenization (UAX#29 word segmentation)
Unicode normalization (NFKC)
Inverted index maintenance
BM25 score calculation
Top-K retrieval with heap

Performance Characteristics:

Add: O(m) where m = tokens
Search: O(q×d_avg) where q = query terms, d_avg = avg docs per term
Memory: Compressed inverted index, no original text stored

Component C: Metadata Filter Engine

Fast filtering using compressed bitmaps.

Categorical Fields (Roaring Bitmaps):

┌────────────────────────────────────┐
│  field:value → Bitmap(docIDs)      │
│                                    │
│  "category:electronics" → {1,5,12} │
│  "category:books"       → {2,8,15} │
│  "in_stock:true"        → {1,2,5}  │
└────────────────────────────────────┘

Numeric Fields (Bit-Sliced Index):

┌────────────────────────────────────┐
│  field → BSI (range queries)       │
│                                    │
│  "price"  → [0-1000, 1000-5000]    │
│  "rating" → [0-5 scale]            │
└────────────────────────────────────┘

Document Universe:

┌────────────────────────────────────┐
│  allDocs → Bitmap(all IDs)         │
│                                    │
│  Used for NOT o

Comet

Install / Use

README

Comet

Table of Contents

Overview

Features

Vector Storage

Search Modalities

Fusion Strategies

Data Structures (The Good Stuff)

Other Capabilities

Installation

Quick Start

Architecture

System Architecture

Application Layer

Search Engine Layer

Index Storage Layer

Hybrid Coordinator

Component Details

Component A: Vector Storage Engine

Component B: Text Search Engine

Component C: Metadata Filter Engine