Infidex
The high-performance .NET search engine based on pattern recognition.
Install / Use
/learn @lofcz/InfidexREADME
Infidex
The high-performance .NET search engine based on pattern recognition
Infidex is a search engine based on pattern recognition. Learning from your data, Infidex automatically extracts features like frequency and rarity and embeds them into a multi-dimensional hypersphere for intelligent matching. This enables fuzzy querying with unparalleled typo tolerance, without any manual tuning. Zero dependencies, blazingly fast, built for developers who need search that just works.
</div>✨ Features
- Blazingly Fast - Index thousands of documents per second, search in milliseconds
- Intelligent Matching - Finds what you're looking for even with typos and variations
- Per-Term Coverage - Ranks documents by how many query terms they match (more terms = higher rank)
- Rich Filtering - SQL-like query language (Infiscript) for complex filters
- Faceted Search - Build dynamic filters and aggregations
- Smart Ranking - Lexicographic (coverage, quality) scoring for principled relevance
- Multi-Field Search - Search across multiple fields with configurable weights
- Incremental Indexing - Add or update documents without rebuilding the entire index
- Fully Thread-Safe - Multiple concurrent readers, writers block readers and other writers
- Production Ready - Comprehensive test coverage, clean API, zero dependencies
- Easy Integration - Embeds directly into your .NET application
Quick Start
Install via NuGet:
dotnet add package Infidex
Basic Search
using Infidex;
using Infidex.Core;
// Create search engine
var engine = SearchEngine.CreateDefault();
// Index documents
var documents = new[]
{
new Document(1L, "The quick brown fox jumps over the lazy dog"),
new Document(2L, "A journey of a thousand miles begins with a single step"),
new Document(3L, "To be or not to be that is the question")
};
engine.IndexDocuments(documents);
// Search with typos - still finds matches!
var results = engine.Search("quik fox", maxResults: 10);
foreach (var result in results.Records)
{
Console.WriteLine($"Doc {result.DocumentId}: Score {result.Score}");
}
Multi-Field Search
using Infidex.Api;
// Define fields with weights
var matrix = new DocumentFields();
matrix.AddField("title", "The Matrix", Weight.High);
matrix.AddField("description", "A computer hacker learns about the true nature of reality", Weight.Low);
var inception = new DocumentFields();
inception.AddField("title", "Inception", Weight.High);
inception.AddField("description", "A thief who steals corporate secrets through dream-sharing", Weight.Low);
var movies = new[]
{
new Document(1L, matrix),
new Document(2L, inception)
};
engine.IndexDocuments(movies);
Infiscript
Infiscript is a simple filtering language used to write intuitive filters that compile to optimized bytecode:
using Infidex.Api;
// Simple comparison
var filter = Filter.Parse("genre = 'Sci-Fi'");
// Boolean logic
var filter = Filter.Parse("genre = 'Sci-Fi' AND year >= 2000");
// Complex expressions with grouping
var filter = Filter.Parse("(genre = 'Fantasy' AND year >= 2000) OR (genre = 'Horror' AND year >= 1980)");
// String operations
var filter = Filter.Parse("title CONTAINS 'matrix'");
var filter = Filter.Parse("title STARTS WITH 'The'");
var filter = Filter.Parse("description LIKE '%dream%'");
// Range checks
var filter = Filter.Parse("year BETWEEN 2000 AND 2020");
var filter = Filter.Parse("rating >= 8.0");
// List membership
var filter = Filter.Parse("genre IN ('Sci-Fi', 'Fantasy', 'Adventure')");
// Null checks
var filter = Filter.Parse("director IS NOT NULL");
// Regex matching
var filter = Filter.Parse("email MATCHES '^[\\w\\.-]+@[\\w\\.-]+\\.\\w+$'");
// Ternary expressions (conditional logic)
var filter = Filter.Parse("age >= 18 ? 'adult' : 'minor'");
var filter = Filter.Parse("score >= 90 ? 'A' : score >= 80 ? 'B' : 'C'");
// Use filters in queries
var query = new Query("matrix", maxResults: 20)
{
Filter = Filter.Parse("year >= 2000 AND rating > 7.0")
};
var results = engine.Search(query);
Infiscript Operators
Full EBNF specification is available here.
Comparison: =, !=, <, <=, >, >=
Boolean: AND (or &&), OR (or ||), NOT (or !)
String: CONTAINS, STARTS WITH, ENDS WITH, LIKE (% wildcard)
Special: IN, BETWEEN, IS NULL, IS NOT NULL, MATCHES (regex)
Conditional: ? : (ternary operator)
All operators are case-insensitive. Use parentheses for grouping.
Bytecode Compilation
Filters compile to portable bytecode for performance and serialization:
// Compile once, use many times
var filter = Filter.Parse("genre = 'Sci-Fi' AND year >= 2000");
var bytecode = filter.CompileToBytes();
// Save to disk
File.WriteAllBytes("filter.bin", bytecode);
// Load and use later
var loaded = Filter.FromBytecode(File.ReadAllBytes("filter.bin"));
var query = new Query("space") { CompiledFilterBytecode = bytecode };
Faceted Search & Aggregations
Build dynamic filters and navigate your data:
var query = new Query("science fiction", maxResults: 50)
{
EnableFacets = true
};
var results = engine.Search(query);
// Get facet counts
if (results.Facets != null)
{
foreach (var (fieldName, values) in results.Facets)
{
Console.WriteLine($"\n{fieldName}:");
foreach (var (value, count) in values)
{
Console.WriteLine($" {value}: {count} documents");
}
}
}
// Output:
// genre:
// Sci-Fi: 15 documents
// Fantasy: 8 documents
// Action: 5 documents
//
// year:
// 2020: 10 documents
// 2019: 8 documents
// 2018: 10 documents
Document Boosting
Increase relevance scores for specific documents:
// Boost recent movies
var recentBoost = new Boost(
Filter.Parse("year >= 2020"),
BoostStrength.Large // +20 to score
);
// Boost highly-rated content
var ratingBoost = new Boost(
Filter.Parse("rating >= 8.0"),
BoostStrength.Medium // +10 to score
);
var query = new Query("action movie", maxResults: 20)
{
EnableBoost = true,
Boosts = new[] { recentBoost, ratingBoost }
};
var results = engine.Search(query);
Boost strengths: Small (+5), Medium (+10), Large (+20), Extreme (+40)
Sorting
Sort results by any field:
// Sort by year (descending)
var query = new Query("thriller", maxResults: 20)
{
SortBy = fields.GetField("year"),
SortAscending = false
};
// Sort by rating (ascending)
var query = new Query("comedy", maxResults: 20)
{
SortBy = fields.GetField("rating"),
SortAscending = true
};
How It Works
Infidex uses a lexicographic ranking model where:
- Precedence is driven by structural and positional properties (coverage, phrase runs, anchor positions, etc.).
- Semantic score is refined using corpus-derived weights (inverse document frequency over character n‑grams), without any per-dataset manual tuning.
Concretely, each query term $q_i$ is assigned a weight
$$ I_i \approx \log_2\frac{N}{\mathrm{df}_i} $$
where $N$ is the number of documents and $\mathrm{df}_i$ is the document frequency of the term’s character n‑grams. Rarer terms get higher weights and therefore contribute more strongly to coverage and fusion decisions.
Three-Stage Search Pipeline
Stage 1: BM25+ Candidate Generation
- Tokenizes text into character n-grams (2-grams + 3-grams)
- Builds inverted index with document frequencies
- BM25+ scoring backbone with L2-normalized term weights:
$$\text{BM25+}(q, d) = \sum_{t \in q} \text{IDF}(t) \cdot \left( \frac{f(t,d) \cdot (k_1 + 1)}{f(t,d) + k_1 \cdot (1 - b + b \cdot \frac{|d|}{\text{avgdl}})} + \delta \right)$$
Formally, let $V$ be the set of all indexed terms over alphabet $\Sigma$. Infidex builds a deterministic finite-state transducer
$$ T = (Q, \Sigma, \delta, q_0, F, \mu) $$
such that for each $t \in V$ there is a unique path from $q_0$ to some $q \in F$ labeled by $t$, and $\mu(t) \in \mathbb{N}$ is a term identifier.
Prefix and suffix queries are then evaluated as:
$$ \mathrm{Pref}(p) = {\mu(t) \mid t \in V,\ t \text{ has prefix } p} $$
$$ \mathrm{Suff}(s) = {\mu(t) \mid t \in V,\ t \text{ has suffix } s} $$
with time complexity $O(|p| + |\mathrm{Pref}(p)|)$ and $O(|s| + |\mathrm{Suff}(s)|)$, respectively.
Stage 2: Lexical Coverage Analysis
- Applied to top-K candidates from Stage 1
- Tracks per-term coverage for each query word using 5 algorithms:
- Exact whole-word matching
- Fuzzy word matching (Damerau–Levenshtein with an edit radius adapted from a binomial typo model)
- Joined/split word detection
- Prefix/suffix matching (prefixes weighted higher than suffixes)
- LCS (Longest Common Subsequence) fallback when no word-level match exists
- For each query term $q_i$, computes per-term coverage:
$$c_i = \min\left(1, \frac{m_i}{|q_i|}\right)$$
where $m_i$ is the number of matched characters for term $i$.
- Derives coordination coverage across all $n$ query terms:
$$C_{\text{coord}} = \frac{1}{n} \sum_{i=1}^{n} c_i$$
- Extracts structural features: phrase runs, anchor token positions, lexical perfection
On top of raw per-term coverage, Infidex tracks how much information mass from the query is actually matched:
-
For each query term $q_i$, we compute a coverage score $c_i \in [0,1]$ and an information weight $I_i$ as above:
$$ C_{\text{info}} = \frac{\sum_
