Infidex

The high-performance .NET search engine based on pattern recognition

Infidex is a search engine based on pattern recognition. Learning from your data, Infidex automatically extracts features like frequency and rarity and embeds them into a multi-dimensional hypersphere for intelligent matching. This enables fuzzy querying with unparalleled typo tolerance, without any manual tuning. Zero dependencies, blazingly fast, built for developers who need search that just works.

</div>

✨ Features

Blazingly Fast - Index thousands of documents per second, search in milliseconds
Intelligent Matching - Finds what you're looking for even with typos and variations
Per-Term Coverage - Ranks documents by how many query terms they match (more terms = higher rank)
Rich Filtering - SQL-like query language (Infiscript) for complex filters
Faceted Search - Build dynamic filters and aggregations
Smart Ranking - Lexicographic (coverage, quality) scoring for principled relevance
Multi-Field Search - Search across multiple fields with configurable weights
Incremental Indexing - Add or update documents without rebuilding the entire index
Fully Thread-Safe - Multiple concurrent readers, writers block readers and other writers
Production Ready - Comprehensive test coverage, clean API, zero dependencies
Easy Integration - Embeds directly into your .NET application

Quick Start

Install via NuGet:

dotnet add package Infidex

Basic Search

using Infidex;
using Infidex.Core;

// Create search engine
var engine = SearchEngine.CreateDefault();

// Index documents
var documents = new[]
{
    new Document(1L, "The quick brown fox jumps over the lazy dog"),
    new Document(2L, "A journey of a thousand miles begins with a single step"),
    new Document(3L, "To be or not to be that is the question")
};

engine.IndexDocuments(documents);

// Search with typos - still finds matches!
var results = engine.Search("quik fox", maxResults: 10);

foreach (var result in results.Records)
{
    Console.WriteLine($"Doc {result.DocumentId}: Score {result.Score}");
}

Multi-Field Search

using Infidex.Api;

// Define fields with weights
var matrix = new DocumentFields();
matrix.AddField("title", "The Matrix", Weight.High);
matrix.AddField("description", "A computer hacker learns about the true nature of reality", Weight.Low);

var inception = new DocumentFields();
inception.AddField("title", "Inception", Weight.High);
inception.AddField("description", "A thief who steals corporate secrets through dream-sharing", Weight.Low);

var movies = new[]
{
    new Document(1L, matrix),
    new Document(2L, inception)
};

engine.IndexDocuments(movies);

Infiscript

Infiscript is a simple filtering language used to write intuitive filters that compile to optimized bytecode:

using Infidex.Api;

// Simple comparison
var filter = Filter.Parse("genre = 'Sci-Fi'");

// Boolean logic
var filter = Filter.Parse("genre = 'Sci-Fi' AND year >= 2000");

// Complex expressions with grouping
var filter = Filter.Parse("(genre = 'Fantasy' AND year >= 2000) OR (genre = 'Horror' AND year >= 1980)");

// String operations
var filter = Filter.Parse("title CONTAINS 'matrix'");
var filter = Filter.Parse("title STARTS WITH 'The'");
var filter = Filter.Parse("description LIKE '%dream%'");

// Range checks
var filter = Filter.Parse("year BETWEEN 2000 AND 2020");
var filter = Filter.Parse("rating >= 8.0");

// List membership
var filter = Filter.Parse("genre IN ('Sci-Fi', 'Fantasy', 'Adventure')");

// Null checks
var filter = Filter.Parse("director IS NOT NULL");

// Regex matching
var filter = Filter.Parse("email MATCHES '^[\\w\\.-]+@[\\w\\.-]+\\.\\w+$'");

// Ternary expressions (conditional logic)
var filter = Filter.Parse("age >= 18 ? 'adult' : 'minor'");
var filter = Filter.Parse("score >= 90 ? 'A' : score >= 80 ? 'B' : 'C'");

// Use filters in queries
var query = new Query("matrix", maxResults: 20)
{
    Filter = Filter.Parse("year >= 2000 AND rating > 7.0")
};

var results = engine.Search(query);

Infiscript Operators

Full EBNF specification is available here.

Comparison: =, !=, <, <=, >, >=
Boolean: AND (or &&), OR (or ||), NOT (or !)
String: CONTAINS, STARTS WITH, ENDS WITH, LIKE (% wildcard)
Special: IN, BETWEEN, IS NULL, IS NOT NULL, MATCHES (regex)
Conditional: ? : (ternary operator)

All operators are case-insensitive. Use parentheses for grouping.

Bytecode Compilation

Filters compile to portable bytecode for performance and serialization:

// Compile once, use many times
var filter = Filter.Parse("genre = 'Sci-Fi' AND year >= 2000");
var bytecode = filter.CompileToBytes();

// Save to disk
File.WriteAllBytes("filter.bin", bytecode);

// Load and use later
var loaded = Filter.FromBytecode(File.ReadAllBytes("filter.bin"));
var query = new Query("space") { CompiledFilterBytecode = bytecode };

Faceted Search & Aggregations

Build dynamic filters and navigate your data:

var query = new Query("science fiction", maxResults: 50)
{
    EnableFacets = true
};

var results = engine.Search(query);

// Get facet counts
if (results.Facets != null)
{
    foreach (var (fieldName, values) in results.Facets)
    {
        Console.WriteLine($"\n{fieldName}:");
        foreach (var (value, count) in values)
        {
            Console.WriteLine($"  {value}: {count} documents");
        }
    }
}

// Output:
// genre:
//   Sci-Fi: 15 documents
//   Fantasy: 8 documents
//   Action: 5 documents
// 
// year:
//   2020: 10 documents
//   2019: 8 documents
//   2018: 10 documents

Document Boosting

Increase relevance scores for specific documents:

// Boost recent movies
var recentBoost = new Boost(
    Filter.Parse("year >= 2020"),
    BoostStrength.Large  // +20 to score
);

// Boost highly-rated content
var ratingBoost = new Boost(
    Filter.Parse("rating >= 8.0"),
    BoostStrength.Medium  // +10 to score
);

var query = new Query("action movie", maxResults: 20)
{
    EnableBoost = true,
    Boosts = new[] { recentBoost, ratingBoost }
};

var results = engine.Search(query);

Boost strengths: Small (+5), Medium (+10), Large (+20), Extreme (+40)

Sorting

Sort results by any field:

// Sort by year (descending)
var query = new Query("thriller", maxResults: 20)
{
    SortBy = fields.GetField("year"),
    SortAscending = false
};

// Sort by rating (ascending)
var query = new Query("comedy", maxResults: 20)
{
    SortBy = fields.GetField("rating"),
    SortAscending = true
};

How It Works

Infidex uses a lexicographic ranking model where:

Precedence is driven by structural and positional properties (coverage, phrase runs, anchor positions, etc.).
Semantic score is refined using corpus-derived weights (inverse document frequency over character n‑grams), without any per-dataset manual tuning.

Concretely, each query term $q_i$ is assigned a weight

$$ I_i \approx \log_2\frac{N}{\mathrm{df}_i} $$

where $N$ is the number of documents and $\mathrm{df}_i$ is the document frequency of the term’s character n‑grams. Rarer terms get higher weights and therefore contribute more strongly to coverage and fusion decisions.

Three-Stage Search Pipeline

Stage 1: BM25+ Candidate Generation

Tokenizes text into character n-grams (2-grams + 3-grams)
Builds inverted index with document frequencies
BM25+ scoring backbone with L2-normalized term weights:

$$\text{BM25+}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \left( \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})} + \delta \right)$$

Formally, let $V$ be the set of all indexed terms over alphabet $\Sigma$. Infidex builds a deterministic finite-state transducer

$$ T = (Q, \Sigma, \delta, q_0, F, \mu) $$

such that for each $t \in V$ there is a unique path from $q_0$ to some $q \in F$ labeled by $t$, and $\mu(t) \in \mathbb{N}$ is a term identifier.
Prefix and suffix queries are then evaluated as:

$$ \mathrm{Pref}(p) = {\mu(t) \mid t \in V,\ t \text{ has prefix } p} $$

$$ \mathrm{Suff}(s) = {\mu(t) \mid t \in V,\ t \text{ has suffix } s} $$

with time complexity $O(|p| + |\mathrm{Pref}(p)|)$ and $O(|s| + |\mathrm{Suff}(s)|)$, respectively.

Stage 2: Lexical Coverage Analysis

Applied to top-K candidates from Stage 1
Tracks per-term coverage for each query word using 5 algorithms:
- Exact whole-word matching
- Fuzzy word matching (Damerau–Levenshtein with an edit radius adapted from a binomial typo model)
- Joined/split word detection
- Prefix/suffix matching (prefixes weighted higher than suffixes)
- LCS (Longest Common Subsequence) fallback when no word-level match exists
For each query term $q_i$, computes per-term coverage:

$$c_i = \min\left(1, \frac{m_i}{|q_i|}\right)$$

where $m_i$ is the number of matched characters for term $i$.

Derives coordination coverage across all $n$ query terms:

$$C_{\text{coord}} = \frac{1}{n} \sum_{i=1}^{n} c_i$$

Extracts structural features: phrase runs, anchor token positions, lexical perfection

On top of raw per-term coverage, Infidex tracks how much information mass from the query is actually matched:

For each query term $q_i$, we compute a coverage score $c_i \in [0,1]$ and an information weight $I_i$ as above:

$$ C_{\text{info}} = \frac{\sum_

Infidex

Install / Use

README

Infidex

✨ Features

Quick Start

Basic Search

Multi-Field Search

Infiscript

Infiscript Operators

Bytecode Compilation

Faceted Search & Aggregations

Document Boosting

Sorting

How It Works

Three-Stage Search Pipeline