SkillAgentSearch skills...

FerrisSearch

A distributed search and SQL analytics engine in Rust with Raft consensus, hybrid BM25+vector search, and a search-aware query planner — powered by Tantivy, DataFusion, and USearch

Install / Use

/learn @RamakrishnaChilaka/FerrisSearch
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center"> <img src="docs/logo.png" alt="FerrisSearch" width="400"> </p>

FerrisSearch

<p align="center"> <strong>A Rust-native distributed search and search-aware analytics engine with Raft consensus, hybrid vector search, OpenSearch-compatible APIs, and SQL over matched docs — powered by <a href="https://github.com/quickwit-oss/tantivy">Tantivy</a></strong> </p> <p align="center"> <a href="#getting-started">Getting Started</a> &middot; <a href="#api-reference">API Reference</a> &middot; <a href="#how-hybrid-sql-works">Architecture</a> &middot; <a href="#benchmarks">Benchmarks</a> &middot; <a href="#replication">Replication</a> &middot; <a href="#testing">Testing</a> &middot; <a href="#roadmap">Roadmap</a> </p>

FerrisSearch is a high-performance, Rust-native distributed search engine with OpenSearch-compatible REST APIs, hybrid vector retrieval, and a search-aware SQL layer for querying matched documents as a dataset. It is built for teams that want the familiar OpenSearch interface with the performance and safety of Rust, without giving up on structured analytics over search results.

⚡ Performance: 2M documents — ingestion at 10,402 docs/sec, search at 142.4 queries/sec (p50 = 24.8ms), zero errors — see benchmarks

Highlights

  • OpenSearch-compatible REST API — drop-in PUT /{index}, POST /_doc, GET /_search endpoints
  • Raft consensus — cluster state managed by openraft; quorum-based leader election, linearizable writes, automatic failover, persistent log storage via redb
  • Vector search — k-NN approximate nearest neighbor search via USearch (HNSW algorithm); hybrid full-text + vector queries
  • Search-aware SQL — SQL over matched docs with pushdown-aware planning, local fast-field execution when possible, planner metadata, and grouped analytics over the matched result set
  • Distributed clustering — multi-node clusters with shard-based data distribution
  • Synchronous replication — primary-replica replication over gRPC; writes acknowledged only after all in-sync replicas confirm
  • Scatter-gather search — queries fan out across shards, results merged and returned
  • Crash recovery — binary write-ahead log (WAL) with configurable durability (request fsync-per-write or async timer-based); sequence number checkpointing and translog-based replica recovery
  • Zero external dependencies — no JVM, no Zookeeper, just a single binary

Why FerrisSearch Is Different

FerrisSearch is not just exposing SQL on top of a search API. The current direction is a true hybrid execution model:

  • Tantivy executes search-native work first: full-text matching, scoring, and pushdown-friendly structured filters
  • Fast fields stay in the hot path: when the query is eligible and shards are local, structured columns are read directly from Tantivy fast fields instead of materializing _source
  • SQL runs on matched docs, not the whole index: Arrow and DataFusion operate on the narrowed result set or merged partial states
  • Planner metadata is visible: responses show execution_mode and a planner block so you can see what was pushed down vs. what stayed residual

That makes FerrisSearch useful for workflows like:

  • relevance debugging with score and structured filters in one query
  • grouped analytics over matched docs
  • internal dashboards over live search results
  • interactive search + analysis without moving logic into client code

How Hybrid SQL Works

flowchart LR
  A[SQL Query] --> B[Hybrid Planner]
  B --> C[Tantivy Search-Native Work]
  B --> D[Residual SQL Work]
  C --> E[Matched docs plus score]
  C --> F[Fast-field reads or shard-local partials]
  E --> G[Arrow RecordBatch]
  F --> G
  D --> H[DataFusion]
  G --> H
  H --> I[Rows plus planner metadata]

The intended execution order is:

  1. Tantivy handles text_match(...), scoring, and pushdown-friendly structured filters.
  2. Eligible structured columns are read directly from fast fields.
  3. Arrow batches represent matched docs or merged partial states.
  4. DataFusion executes only the remaining relational work.
  5. The API returns rows plus execution_mode and planner metadata.

Getting Started

Prerequisites

  • Rust (2024 edition)
  • Protobuf compiler (protoc)

Single node

cargo run

Docker

docker build -t ferrissearch .
docker run -p 9200:9200 -p 9300:9300 ferrissearch
curl http://localhost:9200/
{"name": "ferrissearch-node", "version": "0.1.0", "engine": "tantivy"}

Multi-node cluster

# Terminal 1
./dev_cluster.sh 1    # HTTP 9200 · Transport 9300 · Raft ID 1

# Terminal 2
./dev_cluster.sh 2    # HTTP 9201 · Transport 9301 · Raft ID 2

# Terminal 3
./dev_cluster.sh 3    # HTTP 9202 · Transport 9302 · Raft ID 3

Configuration

Configure via config/ferrissearch.yml or FERRISSEARCH_* environment variables:

| Option | Default | Description | |--------|---------|-------------| | node_name | node-1 | Node identifier | | cluster_name | ferrissearch | Cluster name | | http_port | 9200 | REST API port | | transport_port | 9300 | gRPC transport port | | data_dir | ./data | Data storage directory | | seed_hosts | ["127.0.0.1:9300"] | Seed nodes for discovery | | raft_node_id | 1 | Unique Raft consensus node ID | | translog_durability | request | Translog fsync mode: request (per-write) or async (timer) | | translog_sync_interval_ms | (unset) | Background fsync interval when durability is async (default: 5000) |

API Reference

Indices

# Create an index
curl -X PUT 'http://localhost:9200/my-index' \
  -H 'Content-Type: application/json' \
  -d '{"settings": {"number_of_shards": 1, "number_of_replicas": 1}}'

# Create an index with field mappings
curl -X PUT 'http://localhost:9200/movies' \
  -H 'Content-Type: application/json' \
  -d '{
    "settings": {"number_of_shards": 3, "number_of_replicas": 1},
    "mappings": {
      "properties": {
        "title":     {"type": "text"},
        "genre":     {"type": "keyword"},
        "year":      {"type": "integer"},
        "rating":    {"type": "float"},
        "embedding": {"type": "knn_vector", "dimension": 3}
      }
    }
  }'

# Delete an index
curl -X DELETE 'http://localhost:9200/my-index'

# Get index settings
curl 'http://localhost:9200/my-index/_settings'

# Update dynamic settings (refresh_interval, number_of_replicas)
curl -X PUT 'http://localhost:9200/my-index/_settings' \
  -H 'Content-Type: application/json' \
  -d '{"index": {"refresh_interval": "2s", "number_of_replicas": 2}}'

Supported field types: text (analyzed), keyword (exact match), integer, float, boolean, knn_vector. Unmapped fields are indexed into a catch-all "body" field for backward compatibility.

Documents

# Index a document (auto-generated ID)
curl -X POST 'http://localhost:9200/my-index/_doc' \
  -H 'Content-Type: application/json' \
  -d '{"title": "Hello World", "tags": "rust search"}'

# Index a document with explicit ID
curl -X PUT 'http://localhost:9200/my-index/_doc/1' \
  -H 'Content-Type: application/json' \
  -d '{"title": "Hello World", "year": 2024}'

# Get a document
curl 'http://localhost:9200/my-index/_doc/{id}'

# Delete a document
curl -X DELETE 'http://localhost:9200/my-index/_doc/{id}'

# Partial update a document (merge fields)
curl -X POST 'http://localhost:9200/my-index/_update/1' \
  -H 'Content-Type: application/json' \
  -d '{"doc": {"rating": 9.5, "genre": "scifi"}}'

# Bulk index
curl -X POST 'http://localhost:9200/my-index/_bulk' \
  -H 'Content-Type: application/json' \
  -d '[
    {"_doc_id": "doc-1", "_source": {"name": "Alice"}},
    {"_doc_id": "doc-2", "_source": {"name": "Bob"}}
  ]'

Search

# Match all
curl 'http://localhost:9200/my-index/_search'

# Query string with pagination
curl 'http://localhost:9200/my-index/_search?q=rust&from=0&size=10'

# Count all documents (fast — uses metadata, no search)
curl 'http://localhost:9200/my-index/_count'

# Count matching documents
curl -X POST 'http://localhost:9200/my-index/_count' \
  -H 'Content-Type: application/json' \
  -d '{"query": {"term": {"brand": "Apple"}}}'

# DSL: match query
curl -X POST 'http://localhost:9200/my-index/_search' \
  -H 'Content-Type: application/json' \
  -d '{"query": {"match": {"title": "search engine"}}}'

# DSL: bool query (must + must_not)
curl -X POST 'http://localhost:9200/my-index/_search' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": {
      "bool": {
        "must": [{"match": {"title": "rust"}}],
        "must_not": [{"match": {"title": "web"}}]
      }
    }
  }'

# DSL: bool query (should = OR)
curl -X POST 'http://localhost:9200/my-index/_search' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": {
      "bool": {
        "should": [
          {"match": {"title": "rust"}},
          {"match": {"title": "python"}}
        ]
      }
    },
    "from": 0,
    "size": 5
  }'

# Fuzzy query (typo-tolerant search)
curl -X POST 'http://localhost:9200/my-index/_search' \
  -H 'Content-Type: application/json' \
  -d '{"query": {"fuzzy": {"title": {"value": "rsut", "fuzziness": 2}}}}'

Search-Aware SQL

POST /{index}/_sql runs a SQL query over the matched document set. Tantivy still handles text matching, relevance scoring, and pushed-down structured filters; Arrow and DataFusion handle the residual SQL-style projection, ordering, grouping, and aggregation after search-aware planning.

Current behavior:

  • text_match(field, 'query') is pushed into Tantivy
  • simple =, >, >=, <, <= predicates on structured fields are pushed into Tantivy filters
  • score is exposed as a normal SQL column
  • projection, ORDER BY score, avg(field),

Related Skills

View on GitHub
GitHub Stars6
CategoryData
Updated10h ago
Forks0

Languages

Rust

Security Score

75/100

Audited on Mar 28, 2026

No findings