FerrisSearch
A distributed search and SQL analytics engine in Rust with Raft consensus, hybrid BM25+vector search, and a search-aware query planner — powered by Tantivy, DataFusion, and USearch
Install / Use
/learn @RamakrishnaChilaka/FerrisSearchREADME
FerrisSearch
<p align="center"> <strong>A Rust-native distributed search and search-aware analytics engine with Raft consensus, hybrid vector search, OpenSearch-compatible APIs, and SQL over matched docs — powered by <a href="https://github.com/quickwit-oss/tantivy">Tantivy</a></strong> </p> <p align="center"> <a href="#getting-started">Getting Started</a> · <a href="#api-reference">API Reference</a> · <a href="#how-hybrid-sql-works">Architecture</a> · <a href="#benchmarks">Benchmarks</a> · <a href="#replication">Replication</a> · <a href="#testing">Testing</a> · <a href="#roadmap">Roadmap</a> </p>FerrisSearch is a high-performance, Rust-native distributed search engine with OpenSearch-compatible REST APIs, hybrid vector retrieval, and a search-aware SQL layer for querying matched documents as a dataset. It is built for teams that want the familiar OpenSearch interface with the performance and safety of Rust, without giving up on structured analytics over search results.
⚡ Performance: 2M documents — ingestion at 10,402 docs/sec, search at 142.4 queries/sec (p50 = 24.8ms), zero errors — see benchmarks
Highlights
- OpenSearch-compatible REST API — drop-in
PUT /{index},POST /_doc,GET /_searchendpoints - Raft consensus — cluster state managed by openraft; quorum-based leader election, linearizable writes, automatic failover, persistent log storage via redb
- Vector search — k-NN approximate nearest neighbor search via USearch (HNSW algorithm); hybrid full-text + vector queries
- Search-aware SQL — SQL over matched docs with pushdown-aware planning, local fast-field execution when possible, planner metadata, and grouped analytics over the matched result set
- Distributed clustering — multi-node clusters with shard-based data distribution
- Synchronous replication — primary-replica replication over gRPC; writes acknowledged only after all in-sync replicas confirm
- Scatter-gather search — queries fan out across shards, results merged and returned
- Crash recovery — binary write-ahead log (WAL) with configurable durability (
requestfsync-per-write orasynctimer-based); sequence number checkpointing and translog-based replica recovery - Zero external dependencies — no JVM, no Zookeeper, just a single binary
Why FerrisSearch Is Different
FerrisSearch is not just exposing SQL on top of a search API. The current direction is a true hybrid execution model:
- Tantivy executes search-native work first: full-text matching, scoring, and pushdown-friendly structured filters
- Fast fields stay in the hot path: when the query is eligible and shards are local, structured columns are read directly from Tantivy fast fields instead of materializing
_source - SQL runs on matched docs, not the whole index: Arrow and DataFusion operate on the narrowed result set or merged partial states
- Planner metadata is visible: responses show
execution_modeand aplannerblock so you can see what was pushed down vs. what stayed residual
That makes FerrisSearch useful for workflows like:
- relevance debugging with
scoreand structured filters in one query - grouped analytics over matched docs
- internal dashboards over live search results
- interactive search + analysis without moving logic into client code
How Hybrid SQL Works
flowchart LR
A[SQL Query] --> B[Hybrid Planner]
B --> C[Tantivy Search-Native Work]
B --> D[Residual SQL Work]
C --> E[Matched docs plus score]
C --> F[Fast-field reads or shard-local partials]
E --> G[Arrow RecordBatch]
F --> G
D --> H[DataFusion]
G --> H
H --> I[Rows plus planner metadata]
The intended execution order is:
- Tantivy handles
text_match(...), scoring, and pushdown-friendly structured filters. - Eligible structured columns are read directly from fast fields.
- Arrow batches represent matched docs or merged partial states.
- DataFusion executes only the remaining relational work.
- The API returns rows plus
execution_modeandplannermetadata.
Getting Started
Prerequisites
- Rust (2024 edition)
- Protobuf compiler (
protoc)
Single node
cargo run
Docker
docker build -t ferrissearch .
docker run -p 9200:9200 -p 9300:9300 ferrissearch
curl http://localhost:9200/
{"name": "ferrissearch-node", "version": "0.1.0", "engine": "tantivy"}
Multi-node cluster
# Terminal 1
./dev_cluster.sh 1 # HTTP 9200 · Transport 9300 · Raft ID 1
# Terminal 2
./dev_cluster.sh 2 # HTTP 9201 · Transport 9301 · Raft ID 2
# Terminal 3
./dev_cluster.sh 3 # HTTP 9202 · Transport 9302 · Raft ID 3
Configuration
Configure via config/ferrissearch.yml or FERRISSEARCH_* environment variables:
| Option | Default | Description |
|--------|---------|-------------|
| node_name | node-1 | Node identifier |
| cluster_name | ferrissearch | Cluster name |
| http_port | 9200 | REST API port |
| transport_port | 9300 | gRPC transport port |
| data_dir | ./data | Data storage directory |
| seed_hosts | ["127.0.0.1:9300"] | Seed nodes for discovery |
| raft_node_id | 1 | Unique Raft consensus node ID |
| translog_durability | request | Translog fsync mode: request (per-write) or async (timer) |
| translog_sync_interval_ms | (unset) | Background fsync interval when durability is async (default: 5000) |
API Reference
Indices
# Create an index
curl -X PUT 'http://localhost:9200/my-index' \
-H 'Content-Type: application/json' \
-d '{"settings": {"number_of_shards": 1, "number_of_replicas": 1}}'
# Create an index with field mappings
curl -X PUT 'http://localhost:9200/movies' \
-H 'Content-Type: application/json' \
-d '{
"settings": {"number_of_shards": 3, "number_of_replicas": 1},
"mappings": {
"properties": {
"title": {"type": "text"},
"genre": {"type": "keyword"},
"year": {"type": "integer"},
"rating": {"type": "float"},
"embedding": {"type": "knn_vector", "dimension": 3}
}
}
}'
# Delete an index
curl -X DELETE 'http://localhost:9200/my-index'
# Get index settings
curl 'http://localhost:9200/my-index/_settings'
# Update dynamic settings (refresh_interval, number_of_replicas)
curl -X PUT 'http://localhost:9200/my-index/_settings' \
-H 'Content-Type: application/json' \
-d '{"index": {"refresh_interval": "2s", "number_of_replicas": 2}}'
Supported field types: text (analyzed), keyword (exact match), integer, float, boolean, knn_vector.
Unmapped fields are indexed into a catch-all "body" field for backward compatibility.
Documents
# Index a document (auto-generated ID)
curl -X POST 'http://localhost:9200/my-index/_doc' \
-H 'Content-Type: application/json' \
-d '{"title": "Hello World", "tags": "rust search"}'
# Index a document with explicit ID
curl -X PUT 'http://localhost:9200/my-index/_doc/1' \
-H 'Content-Type: application/json' \
-d '{"title": "Hello World", "year": 2024}'
# Get a document
curl 'http://localhost:9200/my-index/_doc/{id}'
# Delete a document
curl -X DELETE 'http://localhost:9200/my-index/_doc/{id}'
# Partial update a document (merge fields)
curl -X POST 'http://localhost:9200/my-index/_update/1' \
-H 'Content-Type: application/json' \
-d '{"doc": {"rating": 9.5, "genre": "scifi"}}'
# Bulk index
curl -X POST 'http://localhost:9200/my-index/_bulk' \
-H 'Content-Type: application/json' \
-d '[
{"_doc_id": "doc-1", "_source": {"name": "Alice"}},
{"_doc_id": "doc-2", "_source": {"name": "Bob"}}
]'
Search
# Match all
curl 'http://localhost:9200/my-index/_search'
# Query string with pagination
curl 'http://localhost:9200/my-index/_search?q=rust&from=0&size=10'
# Count all documents (fast — uses metadata, no search)
curl 'http://localhost:9200/my-index/_count'
# Count matching documents
curl -X POST 'http://localhost:9200/my-index/_count' \
-H 'Content-Type: application/json' \
-d '{"query": {"term": {"brand": "Apple"}}}'
# DSL: match query
curl -X POST 'http://localhost:9200/my-index/_search' \
-H 'Content-Type: application/json' \
-d '{"query": {"match": {"title": "search engine"}}}'
# DSL: bool query (must + must_not)
curl -X POST 'http://localhost:9200/my-index/_search' \
-H 'Content-Type: application/json' \
-d '{
"query": {
"bool": {
"must": [{"match": {"title": "rust"}}],
"must_not": [{"match": {"title": "web"}}]
}
}
}'
# DSL: bool query (should = OR)
curl -X POST 'http://localhost:9200/my-index/_search' \
-H 'Content-Type: application/json' \
-d '{
"query": {
"bool": {
"should": [
{"match": {"title": "rust"}},
{"match": {"title": "python"}}
]
}
},
"from": 0,
"size": 5
}'
# Fuzzy query (typo-tolerant search)
curl -X POST 'http://localhost:9200/my-index/_search' \
-H 'Content-Type: application/json' \
-d '{"query": {"fuzzy": {"title": {"value": "rsut", "fuzziness": 2}}}}'
Search-Aware SQL
POST /{index}/_sql runs a SQL query over the matched document set. Tantivy still handles text matching, relevance scoring, and pushed-down structured filters; Arrow and DataFusion handle the residual SQL-style projection, ordering, grouping, and aggregation after search-aware planning.
Current behavior:
text_match(field, 'query')is pushed into Tantivy- simple
=,>,>=,<,<=predicates on structured fields are pushed into Tantivy filters scoreis exposed as a normal SQL column- projection,
ORDER BY score,avg(field),
Related Skills
himalaya
339.5kCLI to manage emails via IMAP/SMTP. Use `himalaya` to list, read, write, reply, forward, search, and organize emails from the terminal. Supports multiple accounts and message composition with MML (MIME Meta Language).
oracle
339.5kBest practices for using the oracle CLI (prompt + file bundling, engines, sessions, and file attachment patterns).
prose
339.5kOpenProse VM skill pack. Activate on any `prose` command, .prose files, or OpenProse mentions; orchestrates multi-agent workflows.
Command Development
83.9kThis skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
