Text2VectorSQL
Official implementation of Text2VectorSQL: Towards a Unified Interface for Vector Search and SQL Queries
Install / Use
/learn @OpenDCAI/Text2VectorSQLREADME
💡 Overview
This project proposes Text2VectorSQL, a new task aimed at building a unified natural language interface for querying both structured data and unstructured data.
<table class="center"> <tr> <td width=100% style="border: none"><img src="Figures/fig1.png" style="width:100%"></td> </tr> <tr> <td width="100%" style="border: none; text-align: center; word-wrap: break-word"> Illustration of the Text2VectorSQL task, with scenarios below showing how integrating SQL queries with vector search unlocks semantic filtering, multi-modal matching and retrieval acceleration. These capabilities are indispensable for universal natural language interfaces. </td> </tr> </table>Traditional Text2SQL systems have made significant progress in accessing structured data like tables, but they cannot understand semantic or multi-modal queries. Meanwhile, vector search has become the standard for querying unstructured data (such as text, images), but integrating it with SQL (known as VectorSQL) still relies on manual, error-prone query construction and lacks standardized evaluation methods.
Text2VectorSQL aims to bridge this fundamental gap by providing a comprehensive foundational ecosystem:
<table class="center"> <tr> <td width=100% style="border: none"><img src="Figures/fig2.png" style="width:100%"></td> </tr> <tr> <td width="100%" style="border: none; text-align: center; word-wrap: break-word"> The Text2VectorSQL Ecosystem. The core component is VectorSQLGen pipeline, a large-scale, automated data synthesis engine that produces high-quality training samples. Then, the synthesized data is used to train our family of UniVectorSQL models. Concurrently, a curated subset of data undergoes a more rigorous, human-review process to create VectorSQLBench, our gold-standard evaluation benchmark with a suite of novel and fine-grained metrics. </td> </tr> </table>🚀 Project Architecture & Core Modules
This repository contains a complete ecosystem, divided into four core modules:
- Data Synthesis (Data_Synthesizer): A scalable pipeline for automatically synthesizing "quadruplet" training data (database, natural language question, VectorSQL query, chain-of-thought) starting from public base tables.
- Execution Engine (Execution_Engine):
A backend engine responsible for parsing and executing VectorSQL queries. It handles the special
lembed(model, text)function by calling an Embedding Service and translates it into a native query compatible with the target database (SQLite, PostgreSQL, ClickHouse). - Embedding Service (Embedding_Service): A high-performance API service based on FastAPI that provides on-demand text and image vectorization capabilities for the Execution Engine. It supports multiple models, multiple GPUs, and can automatically cache models.
- Evaluation Framework (Evaluation_Framework): A framework for comprehensively evaluating the performance of Text2VectorSQL models. It provides an accurate assessment by executing the model-generated SQL and the gold SQL, then comparing their execution results (rather than the SQL strings).
🔧 Installation
-
Clone this repository:
git clone https://github.com/OpenDCAI/Text2VectorSQL.git --depth 1 cd Text2VectorSQL -
Depending on the module you need to use, install its separate dependencies. Each module (
Data_Synthesizer,Execution_Engine,Embedding_Service,Evaluation_Framework) has arequirements.txtfile in its directory.For example, to install the dependencies for the Execution Engine:
cd Execution_Engine pip install -r requirements.txt
⚡ Quick Start
You can use the project's toolchain according to the following scenarios based on your goals:
Scenario 1: Run the Embedding Service (Prerequisite for all execution)
Execution_Engine depends on this service to fetch vectors.
- Configure the service:
Go to the
Embedding_Service/directory, create aconfig.yamlfile, and specify the models you want to use.# Example config.yaml server: host: "0.0.0.0" port: 8000 models: - name: "all-MiniLM-L6-v2" hf_model_path: "sentence-transformers/all-MiniLM-L6-v2" local_model_path: "./models/all-MiniLM-L6-v2" trust_remote_code: true # ... other models - Start the service:
The service will run oncd Embedding_Service/ bash run.shhttp://0.0.0.0:8000. Models will be downloaded automatically on the first run.
Scenario 2: Execute a VectorSQL Query
Ensure the Embedding Service from Scenario 1 is running.
Execution_Engine can be used as a command-line tool:
- Configure the engine:
Go to the
Execution_Engine/directory, createengine_config.yaml, and specify the Embedding Service address and database connection information.embedding_service: url: "http://127.0.0.1:8000/embed" # Corresponds to the service in Scenario 1 database_connections: clickhouse: host: "localhost" port: 8123 # ... timeouts: sql_execution: 60 - Run the query:
cd Execution_Engine/ python execution_engine.py \ --sql "SELECT Name FROM musical m ORDER BY L2Distance(Category_embedding, lembed('all-MiniLM-L6-v2','opera')) LIMIT 5;" \ --db-type "clickhouse" \ --db-identifier "musical" \ --config "engine_config.yaml"
Scenario 3: Synthesize a New Text2VectorSQL Dataset
Use the Data_Synthesizer module.
- Configure the pipeline:
Go to the
Data_Synthesizer/directory, copypipeline/config.yaml.exampletoconfig.yaml. Fill in your LLM API-Key, Base-URL, etc., inconfig.yaml. - Select a dataset:
Edit
pipeline/general_pipeline.pyand modify theDATASET_BACKENDandDATASET_TO_LOADvariables at the top. - Run the pipeline:
The final synthesized dataset will be saved in thecd Data_Synthesizer/ python pipeline/general_pipeline.pyresult_pathconfigured inconfig.yaml.
Scenario 4: Evaluate a Text2VectorSQL Model
Use the Evaluation_Framework module.
-
Prepare data: Ensure you have an evaluation file (
eval_data_file) containing the model's predicted SQL. If not, rungenerate.pyto create it. -
Configure evaluation: Go to the
Evaluation_Framework/directory and createevaluation_config.yaml. Configure the database type (db_type), database file root directory (base_dir), input file (eval_data_file), and evaluation metrics (metrics). -
Run the evaluation pipeline:
cd Evaluation_Framework/ # Run the full pipeline (SQL execution + result evaluation) python run_eval_pipeline.py --all --config evaluation_config.yaml # Or run in steps # python run_eval_pipeline.py --execute --config evaluation_config.yaml # python run_eval_pipeline.py --evaluate --config evaluation_config.yamlThe evaluation report (JSON file) will be saved in the configured output path. You can use
aggregate_results.pyto aggregate multiple reports into a CSV.
🧩 Module Details
1. Data Synthesis (Data_Synthesizer)
This module is the foundation for training powerful Text2VectorSQL models (like UniVectorSQL). It generates high-quality Text2VectorSQL datasets through an automated pipeline.
Core Pipeline:
- Database Synthesis & Enhancement (
database_synthesis): Generates structured databases based on Web tables. - Database Vectorization (
vectorization): Identifies "semantically-rich" columns (e.g., descriptions), uses Sentence Transformers to generate vector embeddings, and builds new databases that support vector queries. - VectorSQL & Question Synthesis (
synthesis_sql,synthesis_nl): Automatically generates VectorSQL queries of varying complexity and reverse-translates them to generate corresponding natural language questions. - Chain-of-Thought Synthesis (
synthesis_cot): Generates detailed reasoning steps (Chain-of-Thought) for each data sample, explaining the derivation process from the question to the VectorSQL.
You can run the complete end-to-end synthesis process with a single command using the pipeline/general_pipeline.py script.
2. Execution Engine (Execution_Engine)
This is the runtime core of Text2VectorSQL. It acts as a bridge, parsing VectorSQL queries that contain semantic search intent and translating them into native queries that the database can understand.
Core Features:
- Parses
lembedFunction: The engine is specifically designed to process queries with thelembed(model, text)syntax. - Dynamic Vectorization:
- The engine parses the query and extracts all unique
(model, text)combinations.
- The engine parses the query and extracts all unique
