Cocoindex
Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if you like it!
Install / Use
/learn @cocoindex-io/CocoindexREADME
Ultra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box. Exceptional developer velocity. Production-ready at day 0.
⭐ Drop a star to help us grow!
<div align="center"> <!-- Keep these links. Translations will automatically update with the README. -->Deutsch | English | Español | français | 日本語 | 한국어 | Português | Русский | 中文
</div> </br> <p align="center"> <img src="https://cocoindex.io/images/transformation.svg" alt="CocoIndex Transformation"> </p> </br>CocoIndex makes it effortless to transform data with AI, and keep source data and target in sync. Whether you’re building a vector index, creating knowledge graphs for context engineering or performing any custom data transformations — goes beyond SQL.
</br> <p align="center"> <img alt="CocoIndex Features" src="https://cocoindex.io/images/venn2.svg" /> </p> </br>Exceptional velocity
Just declare transformation in dataflow with ~100 lines of python
# import
data['content'] = flow_builder.add_source(...)
# transform
data['out'] = data['content']
.transform(...)
.transform(...)
# collect data
collector.collect(...)
# export to db, vector db, graph db ...
collector.export(...)
CocoIndex follows the idea of Dataflow programming model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.
Particularly, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.
Plug-and-Play Building Blocks
Native builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components - as easy as assembling building blocks.
<p align="center"> <img src="https://cocoindex.io/images/components.svg" alt="CocoIndex Features"> </p>Data Freshness
CocoIndex keep source data and target in sync effortlessly.
<p align="center"> <img src="https://github.com/user-attachments/assets/f4eb29b3-84ee-4fa0-a1e2-80eedeeabde6" alt="Incremental Processing" width="700"> </p>It has out-of-box support for incremental indexing:
- minimal recomputation on source or logic change.
- (re-)processing necessary portions; reuse cache when possible
Quick Start
If you're new to CocoIndex, we recommend checking out
Setup
- Install CocoIndex Python library
pip install -U cocoindex
-
Install Postgres if you don't have one. CocoIndex uses it for incremental processing.
-
(Optional) Install Claude Code skill for enhanced development experience. Run these commands in Claude Code:
/plugin marketplace add cocoindex-io/cocoindex-claude
/plugin install cocoindex-skills@cocoindex
Define data flow
Follow Quick Start Guide to define your first indexing flow. An example flow looks like:
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
# Add a data source to read files from a directory
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
# Add a collector for data to be exported to the vector index
doc_embeddings = data_scope.add_collector()
# Transform data of each document
with data_scope["documents"].row() as doc:
# Split the document into chunks, put into `chunks` field
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=2000, chunk_overlap=500)
# Transform data of each chunk
with doc["chunks"].row() as chunk:
# Embed the chunk, put into `embedding` field
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"))
# Collect the chunk into the collector.
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
text=chunk["text"], embedding=chunk["embedding"])
# Export collected data to a vector index.
doc_embeddings.export(
"doc_embeddings",
cocoindex.targets.Postgres(),
primary_key_fields=["filename", "location"],
vector_indexes=[
cocoindex.VectorIndexDef(
field_name="embedding",
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
It defines an index flow like this:
<p align="center"> <img width="400" alt="Data Flow" src="https://github.com/user-attachments/assets/2ea7be6d-3d94-42b1-b2bd-22515577e463" /> </p>🚀 Examples and demo
| Example | Description | |---------|-------------| | Text Embedding | Index text documents with embeddings for semantic search | | Code Embedding | Index code embeddings for semantic search | | PDF Embedding | Parse PDF and index text embeddings for semantic search | | PDF Elements Embedding | Extract text and images from PDFs; embed text with SentenceTransformers and images with CLIP; store in Qdrant for multimodal search | | Manuals LLM Extraction | Extract structured information from a manual using LLM | | Amazon S3 Embedding | Index text documents from Amazon S3 | | Azure Blob Storage Embedding | Index text documents from Azure Blob Storage | | Google Drive Text Embedding | Index text documents from Google Drive | | Meeting Notes to Knowledge Graph | Extract structured meeting info from Google Drive and build a knowledge graph | | Docs to Knowledge Graph | Extract relationships from Markdown documents and build a knowledge graph | | Embeddings to Qdrant | Index documents in a Qdrant collection for semantic search | | Embeddings to LanceDB | Index documents in a LanceDB collection for semantic search | | FastAPI Server with Docker | Run the semantic search server in a Dockerized FastAPI setup | | Product Recommendation | Build real-time product recommendations with LLM and graph database| | Image Search with Vision API | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend| | Face Recognition | Recognize faces in images and build embedding index | |
Related Skills
himalaya
326.5kCLI to manage emails via IMAP/SMTP. Use `himalaya` to list, read, write, reply, forward, search, and organize emails from the terminal. Supports multiple accounts and message composition with MML (MIME Meta Language).
openai-image-gen
326.5kBatch-generate images via OpenAI Images API. Random prompt sampler + `index.html` gallery.
claude-opus-4-5-migration
80.4kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
coding-agent
326.5kDelegate coding tasks to Codex, Claude Code, or Pi agents via background process
