Cocoindex

Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if you like it!

Generate Convert Improve

Install / Use

/learn @cocoindex-io/Cocoindex

About this skill

Quality Score

0/100

README

<p align="center"> <img src="https://cocoindex.io/images/github.svg" alt="CocoIndex"> </p> <h1 align="center">Data transformation for AI</h1> <div align="center">

</div> <div align="center"> <a href="https://trendshift.io/repositories/13939" target="_blank"><img src="https://trendshift.io/api/badge/repositories/13939" alt="cocoindex-io%2Fcocoindex | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a> </div>

Ultra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box. Exceptional developer velocity. Production-ready at day 0.

⭐ Drop a star to help us grow!

</div> </br> <p align="center"> <img src="https://cocoindex.io/images/transformation.svg" alt="CocoIndex Transformation"> </p> </br>

CocoIndex makes it effortless to transform data with AI, and keep source data and target in sync. Whether you’re building a vector index, creating knowledge graphs for context engineering or performing any custom data transformations — goes beyond SQL.

</br> <p align="center"> <img alt="CocoIndex Features" src="https://cocoindex.io/images/venn2.svg" /> </p> </br>

Exceptional velocity

Just declare transformation in dataflow with ~100 lines of python

# import
data['content'] = flow_builder.add_source(...)

# transform
data['out'] = data['content']
    .transform(...)
    .transform(...)

# collect data
collector.collect(...)

# export to db, vector db, graph db ...
collector.export(...)

CocoIndex follows the idea of Dataflow programming model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.

Particularly, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.

Plug-and-Play Building Blocks

Native builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components - as easy as assembling building blocks.

Data Freshness

CocoIndex keep source data and target in sync effortlessly.

It has out-of-box support for incremental indexing:

minimal recomputation on source or logic change.
(re-)processing necessary portions; reuse cache when possible

Quick Start

If you're new to CocoIndex, we recommend checking out

Setup

Install CocoIndex Python library

pip install -U cocoindex

Install Postgres if you don't have one. CocoIndex uses it for incremental processing.
(Optional) Install Claude Code skill for enhanced development experience. Run these commands in Claude Code:

/plugin marketplace add cocoindex-io/cocoindex-claude
/plugin install cocoindex-skills@cocoindex

Define data flow

Follow Quick Start Guide to define your first indexing flow. An example flow looks like:

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Add a data source to read files from a directory
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # Add a collector for data to be exported to the vector index
    doc_embeddings = data_scope.add_collector()

    # Transform data of each document
    with data_scope["documents"].row() as doc:
        # Split the document into chunks, put into `chunks` field
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="markdown", chunk_size=2000, chunk_overlap=500)

        # Transform data of each chunk
        with doc["chunks"].row() as chunk:
            # Embed the chunk, put into `embedding` field
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # Collect the chunk into the collector.
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # Export collected data to a vector index.
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.targets.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

It defines an index flow like this:

🚀 Examples and demo

| Example | Description | |---------|-------------| | Text Embedding | Index text documents with embeddings for semantic search | | Code Embedding | Index code embeddings for semantic search | | PDF Embedding | Parse PDF and index text embeddings for semantic search | | PDF Elements Embedding | Extract text and images from PDFs; embed text with SentenceTransformers and images with CLIP; store in Qdrant for multimodal search | | Manuals LLM Extraction | Extract structured information from a manual using LLM | | Amazon S3 Embedding | Index text documents from Amazon S3 | | Azure Blob Storage Embedding | Index text documents from Azure Blob Storage | | Google Drive Text Embedding | Index text documents from Google Drive | | Meeting Notes to Knowledge Graph | Extract structured meeting info from Google Drive and build a knowledge graph | | Docs to Knowledge Graph | Extract relationships from Markdown documents and build a knowledge graph | | Embeddings to Qdrant | Index documents in a Qdrant collection for semantic search | | Embeddings to LanceDB | Index documents in a LanceDB collection for semantic search | | FastAPI Server with Docker | Run the semantic search server in a Dockerized FastAPI setup | | Product Recommendation | Build real-time product recommendations with LLM and graph database| | Image Search with Vision API | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend| | Face Recognition | Recognize faces in images and build embedding index | |

Related Skills

himalaya

326.5k

CLI to manage emails via IMAP/SMTP. Use `himalaya` to list, read, write, reply, forward, search, and organize emails from the terminal. Supports multiple accounts and message composition with MML (MIME Meta Language).

openai-image-gen

326.5k

Batch-generate images via OpenAI Images API. Random prompt sampler + `index.html` gallery.

claude-opus-4-5-migration

80.4k

Migrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5

coding-agent

326.5k

Delegate coding tasks to Codex, Claude Code, or Pi agents via background process