VectorETL
Build super simple end-to-end data & ETL pipelines for your vector databases and Generative AI applications
Install / Use
/learn @ContextData/VectorETLREADME
<a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-blue.svg?style=for-the-badge" /></a>
<a href="https://pypi.org/project/vector-etl/"><img src="https://badgen.net/badge/Open%20Source%20%3F/Yes%21/blue?icon=github" /></a>
VectorETL by Context Data is a modular framework designed to help Data & AI engineers process data for their AI applications in just a few minutes!
VectorETL streamlines the process of converting diverse data sources into vector embeddings and storing them in various vector databases. It supports multiple data sources (databases, cloud storage, and local files), various embedding models (including OpenAI, Cohere, and Google Gemini), and several vector database targets (like Pinecone, Qdrant, and Weaviate).
This pipeline aims to simplify the creation and management of vector search systems, enabling developers and data scientists to easily build and scale applications that require semantic search, recommendation systems, or other vector-based operations.
Features
- Modular architecture with support for multiple data sources, embedding models, and vector databases
- Batch processing for efficient handling of large datasets
- Configurable chunking and overlapping for text data
- Easy integration of new data sources, embedding models, and vector databases
Documentation

Table of Content
- Installation
- Usage
- Project Overview
- Configuration
- Source Configuration
- Using Unstructured to process source files
- Embedding Configuration
- Target Configuration
- Contributing
- Examples
- Documentation
1. Installation
pip install --upgrade vector-etl
or
pip install git+https://github.com/ContextData/VectorETL.git
2. Usage
This section provides instructions on how to use the ETL framework for Vector Databases. We'll cover running, validating configurations, and provide some common usage examples.
Option 1: Import VectorETL into your python application (using a yaml configuration file)
Assuming you have a configuration file similar to the file below.
source:
source_data_type: "database"
db_type: "postgres"
host: "localhost"
database_name: "customer_data"
username: "user"
password: "password"
port: 5432
query: "SELECT * FROM customers WHERE updated_at > :last_updated_at"
batch_size: 1000
chunk_size: 1000
chunk_overlap: 0
embedding:
embedding_model: "OpenAI"
api_key: ${OPENAI_API_KEY}
model_name: "text-embedding-ada-002"
target:
target_database: "Pinecone"
pinecone_api_key: ${PINECONE_API_KEY}
index_name: "customer-embeddings"
dimension: 1536
metric: "cosine"
embed_columns:
- "customer_name"
- "customer_description"
- "purchase_history"
You can then import the configuration into your python project and automatically run it from there
from vector_etl import create_flow
flow = create_flow()
flow.load_yaml('/path/to/your/config.yaml')
flow.execute()
Option 2: Running from the command line using a configuration file
Using the same yaml configuration file from Option 2 above, you can run the process directly from your command line without having to import it into a python application.
To run the ETL framework, use the following command:
vector-etl -c /path/to/your/config.yaml
Option 3: Import VectorETL into your python application
from vector_etl import create_flow
source = {
"source_data_type": "database",
"db_type": "postgres",
"host": "localhost",
"port": "5432",
"database_name": "test",
"username": "user",
"password": "password",
"query": "select * from test",
"batch_size": 1000,
"chunk_size": 1000,
"chunk_overlap": 0,
}
embedding = {
"embedding_model": "OpenAI",
"api_key": ${OPENAI_API_KEY},
"model_name": "text-embedding-ada-002"
}
target = {
"target_database": "Pinecone",
"pinecone_api_key": ${PINECONE_API_KEY},
"index_name": "my-pinecone-index",
"dimension": 1536
}
embed_columns = ["customer_name", "customer_description", "purchase_history"]
flow = create_flow()
flow.set_source(source)
flow.set_embedding(embedding)
flow.set_target(target)
flow.set_embed_columns(embed_columns)
# Execute the flow
flow.execute()
Common Usage Examples
Here are some examples of how to use the ETL framework for different scenarios:
1. Processing Data from a PostgreSQL Database to Pinecone
vector-etl -c config/postgres_to_pinecone.yaml
Where postgres_to_pinecone.yaml might look like:
source:
source_data_type: "database"
db_type: "postgres"
host: "localhost"
database_name: "customer_data"
username: "user"
password: "password"
port: 5432
query: "SELECT * FROM customers WHERE updated_at > :last_updated_at"
batch_size: 1000
chunk_size: 1000
chunk_overlap: 0
embedding:
embedding_model: "OpenAI"
api_key: ${OPENAI_API_KEY}
model_name: "text-embedding-ada-002"
target:
target_database: "Pinecone"
pinecone_api_key: ${PINECONE_API_KEY}
index_name: "customer-embeddings"
dimension: 1536
metric: "cosine"
embed_columns:
- "customer_name"
- "customer_description"
- "purchase_history"
2. Processing CSV Files from S3 to Qdrant
vector-etl -c config/s3_to_qdrant.yaml
Where s3_to_qdrant.yaml might look like:
source:
source_data_type: "Amazon S3"
bucket_name: "my-data-bucket"
prefix: "customer_data/"
file_type: "csv"
aws_access_key_id: ${AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${AWS_SECRET_ACCESS_KEY}
chunk_size: 1000
chunk_overlap: 200
embedding:
embedding_model: "Cohere"
api_key: ${COHERE_API_KEY}
model_name: "embed-english-v2.0"
target:
target_database: "Qdrant"
qdrant_url: "https://your-qdrant-cluster-url.qdrant.io"
qdrant_api_key: ${QDRANT_API_KEY}
collection_name: "customer_embeddings"
embed_columns: []
3. Project Overview
The VectorETL (Extract, Transform, Load) framework is a powerful and flexible tool designed to streamline the process of extracting data from various sources, transforming it into vector embeddings, and loading these embeddings into a range of vector databases.
It's built with modularity, scalability, and ease of use in mind, making it an ideal solution for organizations looking to leverage the power of vector search in their data infrastructure.
Key Aspects:
-
Versatile Data Extraction: The framework supports a wide array of data sources, including traditional databases, cloud storage solutions (like Amazon S3 and Google Cloud Storage), and popular SaaS platforms (such as Stripe and Zendesk). This versatility allows you to consolidate data from multiple sources into a unified vector database.
-
Advanced Text Processing: For textual data, the framework implements sophisticated chunking and overlapping techniques. This ensures that the semantic context of the text is preserved when creating vector embeddings, leading to more accurate search results.
-
State-of-the-Art Embedding Models: The system integrates with leading embedding models, including OpenAI, Cohere, Google Gemini, and Azure OpenAI. This allows you to choose the embedding model that best fits your specific use case and quality requirements.
-
Multiple Vector Database Support: Whether you're using Pinecone, Qdrant, Weaviate, SingleStore, Supabase, or LanceDB, this framework has you covered. It's designed to seamlessly interface with these popular vector databases, allowing you to choose the one that best suits your needs.
-
Configurable and Extensible: The entire framework is highly configurable through YAML or JSON configuration files. Moreover, its modular architecture makes it easy to extend with new data sources, embedding models, or vector databases as your needs evolve.
This ETL framework is ideal for organizations looking to implement or upgrade their vector search capabilities.
By automating the process of extracting data, creating vector embeddings, and storing them in a vector database, this framework significantly reduces the time and complexity involved in setting up a vector search system. It allows data scientists and engineers to focus on deriving insights and building applications, rather than worrying about the intricacies of data processing and vector storage.
4. Configuration
The ETL framework uses a configuration file to specify the details of the source, embedding model, target database, and other parameters. You can use either YAML or JSON format for the configuration file.
Configuration File Structure
The configuration file is divided into three main sections:
source: Specifies the data source detailsembedding: Defines the embedding model to be usedtarget: Outlines the target vector databaseembed_columns: Defines the columns that need to be embedded (mainly for structured data sources)
Example Configurations
Importing VectorETL into your python application
from vector_etl import create_flow
source = {
"source_data_type": "database",
"db_type": "postgr
