Balsa

Balsa is a learned SQL query optimizer. It tailor optimizes your SQL queries to find the best execution plans for your hardware and engine.

Generate Convert Improve

Install / Use

/learn @balsa-project/Balsa

About this skill

Quality Score

0/100

README

Balsa

Balsa is a learned query optimizer. It learns to optimize SQL queries by trial-and-error using deep reinforcement learning and sim-to-real learning.

Notably, Balsa is the first end-to-end learned optimizer that does not rely on learning from an existing expert optimizer's plans, while being able to surpass the performance of expert plans, sometimes by a sizable margin.

For technical details, see the SIGMOD 2022 paper, Balsa: Learning a Query Optimizer Without Expert Demonstrations [bibtex].

Setup

To quickly get started, run the following on one machine which will run both the agent neural network and query execution.

Clone and install Balsa.

<details> <summary>Details</summary>

git clone https://github.com/balsa-project/balsa.git ~/balsa
cd ~/balsa
# Recommended: run inside a Conda environment.
# All commands that follow are run under this conda env.
conda create -n balsa python=3.7 -y
conda activate balsa

pip install -r requirements.txt
pip install -e .
pip install -e pg_executor

</details>

Install Postgres v12.5.

<details> <summary>Details</summary>

This can be done in several ways. For example, installing from source:

cd ~/
wget https://ftp.postgresql.org/pub/source/v12.5/postgresql-12.5.tar.gz
tar xzvf postgresql-12.5.tar.gz
cd postgresql-12.5
./configure --prefix=/data/postgresql-12.5 --without-readline
sudo make -j
sudo make install

echo 'export PATH=/data/postgresql-12.5/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

</details>

Install the pg_hint_plan extension v1.3.7.

<details> <summary>Details</summary>

cd ~/
git clone https://github.com/ossc-db/pg_hint_plan.git -b REL12_1_3_7
cd pg_hint_plan
# Modify Makefile: change line
#   PG_CONFIG = pg_config
# to
#   PG_CONFIG = /data/postgresql-12.5/bin/pg_config
vim Makefile
make
sudo make install

</details>

Load data into Postgres & start it with the correct configuration.

<details> <summary>Details</summary> For example, load the Join Order Benchmark (JOB) tables:

cd ~/
mkdir -p datasets/job && pushd datasets/job
wget -c http://homepages.cwi.nl/~boncz/job/imdb.tgz && tar -xvzf imdb.tgz && popd
# Prepend headers to CSV files
python3 ~/balsa/scripts/prepend_imdb_headers.py

# Create and start the DB
pg_ctl -D ~/imdb initdb

# Copy custom PostgreSQL configuration.
cp ~/balsa/conf/balsa-postgresql.conf ~/imdb/postgresql.conf

# Start the server
pg_ctl -D ~/imdb start -l logfile

# Load data + run analyze (can take several minutes)
cd ~/balsa
bash load-postgres/load_job_postgres.sh ~/datasets/job

Perform basic checks:

psql imdbload
# Check that both primary and foreign key indexes are built:
imdbload=# \d title
[...]

# Check that data count is correct:
imdbload=# select count(*) from title;
  count
---------
2528312
(1 row)

</details>

NOTE: Using one machine is only for quickly trying out Balsa. To cleanly reproduce results, use Cluster mode which automates the above setup on a cloud and separates the training machine from the query execution machines.

Quickstart

First, run the baseline PostgreSQL plans (the expert) on the Join Order Benchmark:

python run.py --run Baseline --local

This will prompt you to log into Weights & Biases to track experiments and visualize metrics easily, which we highly recommend. (You can disable it by prepending the env var WANDB_MODE=disabled or offline).

The first run may take a while due to warming up. After finishing, you can see messages like:

latency_expert/workload (seconds): 156.62 (113 queries)
...
wandb: latency_expert/workload 156.61727

Next, to launch a Balsa experiment:

python run.py --run Balsa_JOBRandSplit --local

The first time this is run, simulation data is collected for all training queures, which will finish in ~5 minutes (cached for future runs):

...
I0323 04:15:48.212239 140382943663744 sim.py:738] Collection done, stats:
I0323 04:15:48.212319 140382943663744 sim.py:742]   num_queries=94 num_collected_queries=77 num_points=516379 latency_s=309.3
I0323 04:16:39.296590 140382943663744 sim.py:666] Saved simulation data (len 516379) to: data/sim-data-88bd801a.pkl

Balsa's simulation-to-reality approach requires first training an agent in simulation, then in real execution. To speed up the simulation phase, we have provided pretrained checkpoints for the simulation agent:

...
I0323 04:18:26.856739 140382943663744 sim.py:985] Loaded pretrained checkpoint: checkpoints/sim-MinCardCost-rand52split-680secs.ckpt

Then, the agent will start the first iteration of real-execution learning: planning all training queries, sending them off for execution, and waiting for these plans to finish. Periodically, test queries are planned and executed for logging.

Handy commands:

To kill the experiment(s): pkill -f run.py
To monitor a machine: dstat -mcdn

Experiment configs

All experiments and their hyperparameters are declared in experiments.py. To run an experiment with the local Postgres execution engine:

# <name> is a config registered in experiments.py.
python run.py --run <name> --local

Main Balsa agent:

| Benchmark | Config | |-----------------------|-----------------------------------| | JOB (Random Split) | Balsa_JOBRandSplit | | JOB Slow (Slow Split) | Balsa_JOBSlowSplit |

Ablation: impact of the simulator (Figure 10):

| Variant | Config | |------------|------------------------------------------------| | Balsa Sim | (main agent) Balsa_JOBRandSplit | | Expert Sim | JOBRandSplit_PostgresSim | | No Sim | JOBRandSplit_NoSim |

NOTE: Running JOBRandSplit_PostgresSim for the first time will be slow (1.1 hours) due to simulation data being collected from EXPLAIN. This data is cached in data/ for future runs.

Ablation: impact of the timeout mechanism (Figure 11):

| Variant | Config | |------------------------|-------------------------------------------------| | Balsa (safe execution) | (main agent) Balsa_JOBRandSplit | | no timeout | JOBRandSplit_NoTimeout |

Ablation: impact of exploration schemes (Figure 12):

| Variant | Config | |--------------------------|---------------------------------------------------| | Balsa (safe exploration) | (main agent) Balsa_JOBRandSplit | | epsilon-greedy | JOBRandSplit_EpsGreedy | | no exploration | JOBRandSplit_NoExplore |

Ablation: impact of training schemes (Figure 13):

| Variant | Config | |-------------------|-------------------------------------------------| | Balsa (on-policy) | (main agent) Balsa_JOBRandSplit | | retrain | JOBRandSplit_RetrainScheme |

Comparision with learning from expert demonstrations (Neo-impl) (Figure 15):

| Variant | Config | |-------------------|-------------------------------------------------| | Balsa | (main agent) Balsa_JOBRandSplit | | Neo-impl | NeoImpl_JOBRandSplit |

Diversified experiences (Figure 16):

| Variant | Config | |-------------------|-------------------------------------------------| | Balsa | (Main agents) JOB Balsa_JOBRandSplit; JOB Slow Balsa_JOBSlowSplit | | Balsa-8x (uses 8 main agents' data) | JOB Balsa_JOBRandSplitReplay; JOB Slow Balsa_JOBSlowSplitReplay |

Generalizing to highly distinct join templates, Ext-JOB (Figure 17):

| Variant | Config | |-------------------|-------------------------------------------------| | Balsa, data collection agent | Balsa_TrainJOB_TestExtJOB | | Balsa-1x (uses 1 base agent's data) | Balsa1x_TrainJOB_TestExtJOB | | Balsa-8x (uses 8 base agents' data) | Balsa8x_TrainJOB_TestExtJOB |

NOTE: When running a Ext-JOB config for the first time, you may see the error Missing nodes in init_experience. This means data/initial_policy_data.pkl contains the expert latencies of all 113 JOB queries for printing (assuming you ran the previous configs first) but lacks the new Ext-JOB queries. To fix, rename the previous .pkl file and rerun the new Ext-JOB config, which will automatically run the expert plans of the new query set to regenerate this file (as well as gathering the simulation data).

To specify a new experiment, subclass an existing config (give the subclass a descriptive name), change the values of some hyperparameters, and r

Related Skills

feishu-drive

346.8k

things-mac

346.8k

Manage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)

clawhub

346.8k

Use the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com

codebase-memory-mcp

1.2k

High-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-ms queries, 99% fewer tokens. Single static binary, zero dependencies.

balsa-project

View profile

View on GitHub

GitHub Stars147

CategoryData

Updated6d ago

Forks36

balsa-project/balsa

Languages

Python

Security Score

100/100

Audited on Mar 28, 2026

No findings