<img src="https://raw.githubusercontent.com/Semafind/semadb/main/docs/static/logowithtext.svg" alt="SemaDB" style="height: 200px;"/> No fuss multi-index hybrid vector database / search engine

GitHub Issues or Pull Requests GitHub License

SemaDB is a multi-index, multi-vector, document-based vector database / search engine. It is designed to offer a clear and easy-to-use JSON RESTful API. The original components of SemaDB were built for a knowledge-management project at Semafind before it was developed into a standalone project. The goal is to provide a simple, modern, and efficient search engine that can be used in a variety of applications.

Looking for a hosted solution? SemaDB Cloud Beta is available on RapidAPI.

Features ⚡

Vector search: leverage the power of vector search to find similar items and build AI applications. SemaDB uses the graph-based Vamana algorithm to perform efficient approximate nearest neighbour search.
Keyword / text search: search for documents based on keywords or phrases, categories, tags etc.
Geo indices: search for documents based on their location either via latitude and longitude or geo hashes.
Multi-vector search: search across multiple vectors at the same time for a single document each with own index.
Quantized vector search: use quantizers to change internal vector representations to reduce memory usage.
Hybrid search: combine vector and keyword search to find the most relevant documents in a single search request. Use weights to adjust the importance of each search type.
Filter search: filter search results based on other queries or metadata.
Hybrid, filter, multi-vector, multi-index search: combine all the above search types in a single query. For example, "find me the nearest restaurants (geo index) that are open now (inverted index), have a rating of 4 or more (integer index), and serve a dish similar to this image (vector search) and have a similar description to this text (vector search)".
Simple REST API: JSON or MessagePack based, restful API for interacting with the database. No need to learn a new query language, install custom clients or libraries.
Real-time: changes are visible immediately and search results are returned in milliseconds, even for large datasets.
Single binary: the entire database is contained in a single binary. No need to install additional dependencies or services.
Multiple deployment modes: standalone, container, or cloud. SemaDB can be deployed in a variety of ways to suit your needs.
Prometheus metrics: monitor the health of SemaDB with metrics such as number of searched points, latency of search, number of requests etc.
Cluster mode: where the data is distributed to multiple servers and search is offloaded to all participating machines.
Automatic-sharding: data is automatically sharded across multiple servers based on a hashing algorithm.
Multi-tenancy: multiple users with different plans can use the same SemaDB instance. Each user can have their own collections and indices.
Legible source code: the goal is to allow anyone pick a random file and hopefully understand what is going on. Some files have more comments than code.

Getting Started

SemaDB has an easy to use HTTP API to perform operations. After running (see below), you can create a collection, add points and search them:

import requests
base_url = "https://semadb.p.rapidapi.com"
# Or use an appropriate base_url if you are using a self-hosted instance, e.g.
# base_url = "http://localhost:8081/v2"

headers = {
	"content-type": "application/json",
	"X-RapidAPI-Key": "<SEMADB_API_KEY>",
	"X-RapidAPI-Host": "semadb.p.rapidapi.com"
    # Or if self-hosting
    # "X-User-Id": "<USER_ID>",
    # "X-User-Plan": "BASIC"
}

# Create Collection

payload = {
	"id": "mycollection",
    "indexSchema": {
        "vector": {
            "type": "vectorVamana",
            "vectorVamana": {
                "vectorSize": 384, # e.g. Sentence transformers give embeddings of size 384
                "distanceMetric": "cosine",
                "searchSize": 75, # How exhaustive the search should be?
                "degreeBound": 64, # How dense the graph should be?
                "alpha": 1.2, # How much longer edges should be preferred?
            }
        }
    }
}

response = requests.post(base_url + "/collections", json=payload, headers=headers)

print(response.json())
# {"message": "collection created"}

# Insert Points

payload = {
  "points": [
    {
      "vector": [...],
      "myfield": "..."
    },
  ]
}

response = requests.post(base_url+"/collections/mycollection/points", json=payload, headers=headers)

## Or with message pack for faster compact encoding, recommended for inserting points
import msgpack
headers["content-type"] = "application/msgpack"
response = requests.post(base_url+"/collections/mycollection/points", data=msgpack.dumps(payload), headers=headers)

print(response.json())

# Search

payload = {
    "query": {
        "property": "vector",
        "vectorVamana": {
            "vector": [...], # Vector to search
            "operator": "near",
            "searchSize": 75,
            "limit": 3, # top 3 results
        }
    },
    # Restrict what is returned
    "select": ["myfield"],
    "limit": 3, # overall search limit if hybrid search etc.
}

response = requests.post(base_url+"/collections/mycollection/points/search", json=payload, headers=headers)

print(response.json())

For further instructions, please refer to the getting started guide and the full documentation.

Running

To get started from source, please follow the instructions to install Go. That is the only dependency required to run SemaDB. We try to keep SemaDB as self-contained as possible and up-to-date with the latest Go releases.

SemaDB reads all the configuration from a yaml file, there are some examples contained in the config folder. You can run a single server using:

SEMADB_CONFIG=./config/singleServer.yaml go run ./

If you are using VS Code as your editor, then there are already pre-made tasks that do the same thing but also launch a cluster locally too in debug mode.

After you have a server running, you can use the samples file to see some example requests that can be made to the server. To make the most of it, install the REST Client extension which will allow you to make requests directly in the editor and show the results.

Docker & Podman

You can run the latest version of SemaDB using the following repository container image:

docker run -it --rm -v ./config:/config -e SEMADB_CONFIG=/config/singleServer.yaml -v ./data:/data -p 8081:8081 ghcr.io/semafind/semadb:main
# If using podman
podman run -it --rm -v ./config:/config:Z -e SEMADB_CONFIG=/config/singleServer.yaml -v ./data:/data:Z -p 8081:8081 ghcr.io/semafind/semadb:main

which will run the main branch. There are also tagged versions for specific releases. See the container registry of the repository stable and production ready versions.

You can locally build and run the container image using:

docker build -t semadb ./
docker run -it --rm -v ./config:/config -e SEMADB_CONFIG=/config/singleServer.yaml -v ./data:/data -p 8081:8081 semadb
# If using podman
podman build -t semadb ./
# The :Z argument relabels to access: see https://github.com/containers/podman/issues/3683
podman run -it --rm -v ./config:/config:Z -e SEMADB_CONFIG=/config/singleServer.yaml -v ./data:/data:Z -p 8081:8081 semadb

Data Persistence: SemaDB stores data in a directory on disk which is specified in the configuration file as rootDir. By default, the data directory is ./data and the semadb executable is located at / giving /data as the mount point in the container.

Please note that when using docker, the hostname and whitelisting of IPs may need to be adjusted depending on the network configuration of docker. Leaving hostname as a blank string and setting whitelisting to '*' opens up SemaDB to every connection as done in the singleServer.yaml configuration.

Contributing

Contributions are welcome! Please read the contributing guide file for more information. The contributing guide also contains information about the architecture of SemaDB and how to get started with development.

Search Algorithm 🔍

SemaDB's core vector search algorithm is based on the following excellent research papers:

Jayaram Subramanya, Suhas, et al. "Diskann: Fast accurate billion-

Semadb

Install / Use

README