SkillAgentSearch skills...

Lang2sql

A tutorial for setting an SQL code generator with the OpenAI API

Install / Use

/learn @RamiKrispin/Lang2sql
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

lang2sql

This repo provides a step-by-step guide and a template for setting up a natural language to SQL code generator with the OpneAI API.

The tutorial is also available on Medium.

Last update: Jan 7st, 2024

Table of Contents:

Motivation

The rapid development of natural language models, especially Large Language Models (LLMs), has presented numerous possibilities for various fields. One of the most common applications is using LLMs for coding. For instance, OpenAI's chatGPT and Meta's Code LLAMA are LLMs that offer state-of-the-art natural language to code generators. One potential use case is a natural language to SQL code generator, which could assist non-technical professionals with simple data requests and hopefully enable the data teams to focus on more data-intensive tasks. This tutorial focuses on setting up a language for SQL code generator using the OpenAI API.

What can you do with language to SQL code generator application?

One possible application is a chatbot that can respond to user queries with relevant data (Figure 1). The chatbot can be integrated with a Slack channel using a Python application that performs the following steps:

  • Receives the user's question
  • Converts the question into a prompt
  • Sends a GET request to the OpenAI API with the prompt
  • Parses the returned JSON into a SQL query
  • Sends the query to the database
  • Returns the user a CSV file containing the relevant data
<figure> <img src="images/general_arch1.png" width="100%" align="center"/></a> <figcaption> Figure 1 - Language to SQL code generator use case</figcaption> </figure> <br> <br />

In this tutorial, we will build a step-by-step Python application that converts user questions into SQL queries.

Scope and General Architecture

This tutorial provides a step-by-step guide on how to set up a Python application that converts general questions into SQL queries using the OpenAI API. That includes the following functionality:

  • Generalized - the application is not limited to a specific table and can be used on any table
  • For simplicity, the application is limited to a single table (e.g., no joins)
  • Dockerized - develop the application inside a dockerized environment for a simple deployment

Figure 2 below describes the general architecture of a simple language to SQL code generator.

<figure> <img src="images/general_arch2.png" width="100%" align="center"/></a> <figcaption> Figure 2 - Language to SQL code generator general architecture</figcaption> </figure> <br> <br />

The scope and focus of this tutorial is on the green box - building the following functionality:

  • Question to Prompt - transform the question into a prompt format:

    • Pull the table information to create the prompt context
    • Add the question to the prompt
  • API Handler - a function that works with the OpenAI API:

    • Send a GET request with the prompt
    • Parse the answer into an SQL query
  • DB Handler - a function that sends the SQL query to the database and returns the required data

Prerequisites

The main prerequisite for this tutorial is basic knowledge of Python. That includes the following functionality:

  • Setting Python functions and objects
  • Working with tabular data (i.e., pandas, CSV, etc.) and non-structure data format(i.e., JSON, etc.)
  • Working with Python libraries

In addition, basic knowledge of SQL and access to the OpenAI API are required.

Nice to Have

While not necessary, having a basic knowledge of Docker is helpful, as the tutorial was created in a Dockerized environment using VScode's Dev Containers extension. If you don't have experience with Docker or the extension, you can still run the tutorial by creating a virtual environment and installing the required libraries (as described below). Knowledge of Prompt engineering and the OpenAI API is also beneficial.

I created a detailed tutorial about setting a Python dockerized environment with VScode and the Dev Containers extension:

https://github.com/RamiKrispin/vscode-python

Python Libraries

To set up a natural language to SQL code generation, we will use the following Python libraries:

  • pandas - to process data throughout the process
  • duckdb - to simulate the work with the database
  • openai - to work with the OpenAI API
  • time and os - to load CSV files and format fields

This repository contains the necessary settings to launch a Dockerized environment with the tutorial requirements in VScode and the Dev Containers extension. More details are available in the next section.

Alternatively, you can set up a virtual environment and install the tutorial requirements by following the instructions below using the instructions in the Using Virtual Environment section.

Using Docker with VScode

This tutorial was built inside a dockerized environment with VScode and the Dev Containers extension. To run it with VScode, you will need to install the Dev Containers extension and have Docker Desktop (or equivalent) open. The settings of the environment are available under the .devcontainer folder:

.── .devcontainer
    ├── Dockerfile
    ├── Dockerfile.dev
    ├── devcontainer.json
    ├── install_dependencies_core.sh
    ├── install_dependencies_other.sh
    ├── install_quarto.sh
    ├── requirements_core.txt
    ├── requirements_openai.txt
    └── requirements_transformers.txt

The devcontainer.json has the build instructions and the VScode settings for this dockerized environment:

{
    "name": "lang2sql",
    "build": {
        "dockerfile": "Dockerfile",
        "args": {
            "ENV_NAME": "lang2sql",
            "PYTHON_VER": "3.10",
            "METHOD": "openai",
            "QUARTO_VER": "1.3.450"
        },
        "context": "."
    },
    "customizations": {
        "settings": {
            "python.defaultInterpreterPath": "/opt/conda/envs/lang2sql/bin/python",
            "python.selectInterpreter": "/opt/conda/envs/lang2sql/bin/python"
        },
        "vscode": {
            "extensions": [
                "quarto.quarto",
                "ms-azuretools.vscode-docker",
                "ms-python.python",
                "ms-vscode-remote.remote-containers",
                "yzhang.markdown-all-in-one",
                "redhat.vscode-yaml",
                "ms-toolsai.jupyter"
            ]
        }
    },
    "remoteEnv": {
        "OPENAI_KEY": "${localEnv:OPENAI_KEY}"
    }
}

Where the build argument defines the docker build method and sets the arguments for the build. In this case, we set the Python version to 3.10, and the conda virtual environment to ang2sql. The METHOD argument defines the type of environment - either openai to install the requirements libraries for this tutorial using the OpenAI API or transformers to set the environment for HuggingFaces API (which is out of scope for this tutorial).

The remoteEnv argument enables setting environment variables. We will use it to set the OpenAI API key. In this case, I set the variable locally as OPENAI_KEY, and I am loading it using the localEnv argument.

If you want to learn more about setting up a Python development environment with VScode and Docker, check this tutorial.

Using Virtual Environment

If you are not using the tutorial dockerized environment, you can create a local virtual environment from the command line using the script below:

ENV_NAME=openai_api
PYTHON_VER=3.10
conda create -y --name $ENV_NAME python=$PYTHON_VER 

conda activate $ENV_NAME

pip3 install -r ./.devcontainer/requirements_core.txt
pip3 install -r ./.devcontainer/requirements_openai.txt

Note: I used conda and it should work as well with any other virutal environment method.

We use the ENV_NAME and PYTHON_VER variables to set the virtual environment and the Python version, respectively.

To confirm that your environment is properly set, use the conda list to confirm that the required Python libraries are installed. You should expect the below output:

(openai_api) root@0ca5b8000cd5:/workspaces/lang2sql# conda list
# packages in environment at /opt/conda/envs/openai_api:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                      51_gnu  
aiohttp                   3.9.0                    pypi_0    pypi
aiosignal                 1.3.1                    pypi_0    pypi
asttokens                 2.4.1                    pypi_0    pypi
async-timeout             4.0.3                    pypi_0    pypi
attrs                     23.1.0                   pypi_0    pypi
bzip2                     1.0.8                hfd63f10_2  
ca-certificates           2023.08.22           hd43f75c_0  
certifi                   2023.11.17               pypi_0    pypi
charset-normalizer        3.3.2                    pypi_0    pypi
comm                      0.2.0                    pypi_0    pypi
contourpy                 1.2.0                    pypi_0    pypi
cycler                    0.12.1                   pypi_0    pypi
d
View on GitHub
GitHub Stars246
CategoryData
Updated4mo ago
Forks33

Languages

Jupyter Notebook

Security Score

82/100

Audited on Dec 6, 2025

No findings