<h1 align="center"> Text2VectorSQL: Towards a Unified Interface for <br>Vector Search and SQL Queries </h1> <p align="center"> <a href='https://arxiv.org/abs/2506.23071'><img src='https://img.shields.io/badge/arXiv-2506.23071-b31b1b.svg'></a> <a href='https://huggingface.co/VectorSQL'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-VectorSQL-blue'></a> <a href="https://opensource.org/license/apache-2-0" target="_blank"><img alt="License: apache-2-0" src="https://img.shields.io/github/license/saltstack/salt"></a> <a href="https://github.com/OpenDCAI/Text2VectorSQL" target="_blank"><img alt="GitHub Stars" src="https://img.shields.io/github/stars/OpenDCAI/Text2VectorSQL?style=social"></a> </p> <h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for the latest update.</h5>

💡 Overview

This project proposes Text2VectorSQL, a new task aimed at building a unified natural language interface for querying both structured data and unstructured data.

<table class="center"> <tr> <td width=100% style="border: none"><img src="Figures/fig1.png" style="width:100%"></td> </tr> <tr> <td width="100%" style="border: none; text-align: center; word-wrap: break-word"> Illustration of the Text2VectorSQL task, with scenarios below showing how integrating SQL queries with vector search unlocks semantic filtering, multi-modal matching and retrieval acceleration. These capabilities are indispensable for universal natural language interfaces. </td> </tr> </table>

Traditional Text2SQL systems have made significant progress in accessing structured data like tables, but they cannot understand semantic or multi-modal queries. Meanwhile, vector search has become the standard for querying unstructured data (such as text, images), but integrating it with SQL (known as VectorSQL) still relies on manual, error-prone query construction and lacks standardized evaluation methods.

Text2VectorSQL aims to bridge this fundamental gap by providing a comprehensive foundational ecosystem:

<table class="center"> <tr> <td width=100% style="border: none"><img src="Figures/fig2.png" style="width:100%"></td> </tr> <tr> <td width="100%" style="border: none; text-align: center; word-wrap: break-word"> The Text2VectorSQL Ecosystem. The core component is VectorSQLGen pipeline, a large-scale, automated data synthesis engine that produces high-quality training samples. Then, the synthesized data is used to train our family of UniVectorSQL models. Concurrently, a curated subset of data undergoes a more rigorous, human-review process to create VectorSQLBench, our gold-standard evaluation benchmark with a suite of novel and fine-grained metrics. </td> </tr> </table>

🚀 Project Architecture & Core Modules

This repository contains a complete ecosystem, divided into four core modules:

Data Synthesis (Data_Synthesizer): A scalable pipeline for automatically synthesizing "quadruplet" training data (database, natural language question, VectorSQL query, chain-of-thought) starting from public base tables.
Execution Engine (Execution_Engine): A backend engine responsible for parsing and executing VectorSQL queries. It handles the special lembed(model, text) function by calling an Embedding Service and translates it into a native query compatible with the target database (SQLite, PostgreSQL, ClickHouse).
Embedding Service (Embedding_Service): A high-performance API service based on FastAPI that provides on-demand text and image vectorization capabilities for the Execution Engine. It supports multiple models, multiple GPUs, and can automatically cache models.
Evaluation Framework (Evaluation_Framework): A framework for comprehensively evaluating the performance of Text2VectorSQL models. It provides an accurate assessment by executing the model-generated SQL and the gold SQL, then comparing their execution results (rather than the SQL strings).

🔧 Installation

Clone this repository:

git clone https://github.com/OpenDCAI/Text2VectorSQL.git --depth 1
cd Text2VectorSQL

Depending on the module you need to use, install its separate dependencies. Each module (Data_Synthesizer, Execution_Engine, Embedding_Service, Evaluation_Framework) has a requirements.txt file in its directory.

For example, to install the dependencies for the Execution Engine:
```
cd Execution_Engine
pip install -r requirements.txt
```

⚡ Quick Start

You can use the project's toolchain according to the following scenarios based on your goals:

Scenario 1: Run the Embedding Service (Prerequisite for all execution)

Execution_Engine depends on this service to fetch vectors.

Configure the service: Go to the Embedding_Service/ directory, create a config.yaml file, and specify the models you want to use.

# Example config.yaml
server:
  host: "0.0.0.0"
  port: 8000

models:
  - name: "all-MiniLM-L6-v2"
    hf_model_path: "sentence-transformers/all-MiniLM-L6-v2"
    local_model_path: "./models/all-MiniLM-L6-v2"
    trust_remote_code: true
  # ... other models

Start the service:
```
cd Embedding_Service/
bash run.sh
```
The service will run on http://0.0.0.0:8000. Models will be downloaded automatically on the first run.

Scenario 2: Execute a VectorSQL Query

Ensure the Embedding Service from Scenario 1 is running.

Execution_Engine can be used as a command-line tool:

Configure the engine: Go to the Execution_Engine/ directory, create engine_config.yaml, and specify the Embedding Service address and database connection information.

embedding_service:
  url: "http://127.0.0.1:8000/embed" # Corresponds to the service in Scenario 1

database_connections:
  clickhouse:
    host: "localhost"
    port: 8123
    # ...

timeouts:
  sql_execution: 60

Run the query:

cd Execution_Engine/
python execution_engine.py \
    --sql "SELECT Name FROM musical m ORDER BY L2Distance(Category_embedding, lembed('all-MiniLM-L6-v2','opera')) LIMIT 5;" \
    --db-type "clickhouse" \
    --db-identifier "musical" \
    --config "engine_config.yaml"

Scenario 3: Synthesize a New Text2VectorSQL Dataset

Use the Data_Synthesizer module.

Configure the pipeline: Go to the Data_Synthesizer/ directory, copy pipeline/config.yaml.example to config.yaml. Fill in your LLM API-Key, Base-URL, etc., in config.yaml.
Select a dataset: Edit pipeline/general_pipeline.py and modify the DATASET_BACKEND and DATASET_TO_LOAD variables at the top.
Run the pipeline:
```
cd Data_Synthesizer/
python pipeline/general_pipeline.py
```
The final synthesized dataset will be saved in the result_path configured in config.yaml.

Scenario 4: Evaluate a Text2VectorSQL Model

Use the Evaluation_Framework module.

Prepare data: Ensure you have an evaluation file (eval_data_file) containing the model's predicted SQL. If not, run generate.py to create it.
Configure evaluation: Go to the Evaluation_Framework/ directory and create evaluation_config.yaml. Configure the database type (db_type), database file root directory (base_dir), input file (eval_data_file), and evaluation metrics (metrics).

Run the evaluation pipeline:

cd Evaluation_Framework/

# Run the full pipeline (SQL execution + result evaluation)
python run_eval_pipeline.py --all --config evaluation_config.yaml

# Or run in steps
# python run_eval_pipeline.py --execute --config evaluation_config.yaml
# python run_eval_pipeline.py --evaluate --config evaluation_config.yaml

The evaluation report (JSON file) will be saved in the configured output path. You can use aggregate_results.py to aggregate multiple reports into a CSV.

🧩 Module Details

1. Data Synthesis (Data_Synthesizer)

This module is the foundation for training powerful Text2VectorSQL models (like UniVectorSQL). It generates high-quality Text2VectorSQL datasets through an automated pipeline.

Core Pipeline:

Database Synthesis & Enhancement (database_synthesis): Generates structured databases based on Web tables.
Database Vectorization (vectorization): Identifies "semantically-rich" columns (e.g., descriptions), uses Sentence Transformers to generate vector embeddings, and builds new databases that support vector queries.
VectorSQL & Question Synthesis (synthesis_sql, synthesis_nl): Automatically generates VectorSQL queries of varying complexity and reverse-translates them to generate corresponding natural language questions.
Chain-of-Thought Synthesis (synthesis_cot): Generates detailed reasoning steps (Chain-of-Thought) for each data sample, explaining the derivation process from the question to the VectorSQL.

You can run the complete end-to-end synthesis process with a single command using the pipeline/general_pipeline.py script.

2. Execution Engine (Execution_Engine)

This is the runtime core of Text2VectorSQL. It acts as a bridge, parsing VectorSQL queries that contain semantic search intent and translating them into native queries that the database can understand.

Core Features:

Parses lembed Function: The engine is specifically designed to process queries with the lembed(model, text) syntax.
Dynamic Vectorization:
1. The engine parses the query and extracts all unique (model, text) combinations.

Text2VectorSQL

Install / Use

README