SkillAgentSearch skills...

Sqlean

🧹 Clean your SQL queries! 🧹

Install / Use

/learn @oliverxchen/Sqlean
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

sqlean

<p align="center"> <a href='https://coveralls.io/github/oliverxchen/sqlean'><img src='https://coveralls.io/repos/github/oliverxchen/sqlean/badge.svg' alt='Coverage Status' /></a> <a href="https://www.python.org/"><img alt="Python versions: 3.7, 3.8, 3.9, 3.10" src="https://img.shields.io/badge/python-3.7%20%7C%203.8%20%7C%203.9%20%7C%203.10%20%7C%203.11b-blue"></a> <a href="https://pypi.org/project/sqlean/"><img alt="PyPI" src="https://img.shields.io/pypi/v/sqlean"></a> <a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a> <a href="https://github.com/oliverxchen/sqlean/blob/main/LICENSE"><img alt="License: MIT" src="https://img.shields.io/badge/license-MIT-A31F34"></a> </p>

sqlean is a Python CLI to clean your SQL queries. It aspires to be the black for SQL queries: an uncompromising query formatter that imposes a strict, strongly-opinionated formatting standard, with an easy workflow to achieve that standard.

Early releases of sqlean will support BigQuery dialect SQL queries with or without dbt syntax. If there's sufficient interest, other SQL dialects can be added.

WARNING: this is very early in sqlean's life and it should not be used in production.

Usage

Installing

Make sure you're in a virtual environment and install sqlean with pip:

pip install sqlean

sqlean command

The sqlean CLI has one argument and several options. The one argument in the sqlean CLI is the TARGET which can be a file or a directory. If the TARGET is a directory, sqlean will be applied to all SQL files within the directory and all sub-directories.

When sqlean is run without any options, SQL files will be changed in place so that the files comply with sqlean's formatting rules. Here's an example of a successful run (with no options):

<img src="https://github.com/oliverxchen/sqlean/raw/main/docs/img/pass_summary.png" width=640 alt="Succesful run of sqlean with summary table."/>

sqlean is thoroughly tested to ensure that original queries have the same abstract syntax tree (AST) as the formatted query. In plain English, that means that sqlean should never break a query. But sqlean is very new so if you don't fully trust sqlean yet, you can use the dry-run option (-d/--dry-run) to display the diff that would happen if sqlean is run for real on the file :

<img src="https://github.com/oliverxchen/sqlean/raw/main/docs/img/diff.png" width=640 alt="Dry-run of sqlean with diff."/>

Because sqlean needs to build an AST, if sqlean doesn't understand an SQL query then it is unparsable. In this case, the example query is unparsable because there's a problem with the query, and the verbose option (-v/--verbose) points to the problem:

<img src="https://github.com/oliverxchen/sqlean/raw/main/docs/img/verbose.png" width=640 alt="Verbose run of sqlean with error message."/>

Brief documentation on sqlean commands can be accessed with:

sqlean --help

Workflow

Once sqlean is production ready, it is ideally used in your continuous integration (CI) workflow so that a pull request that adds or changes SQL queries can only be merged if sqlean runs without errors (i.e. with the final message "All files passed").

When you first introduce sqlean into your workflow, many of your SQL files can be automatically fixed. In an example production query-base of over 1,600 files, nearly 60% of queries could be automatically fixed.

The other SQL files are unparsable. sqlean will need further development to treat these files and you can help by reporting the issue so that the problem can be addressed.

In the mean time though, the unparsable files can be marked so that sqlean knows to ignore them in CI runs. You can do this automatically by running sqlean with the ignore option (-i/--write-ignore), or just by inserting this comment as the first line of the file that you want ignored:

# sqlean ignore

If some of the issues are addressed by a future version of sqlean, you can upgrade sqlean and automatically remove the ignore indicators with the force (-f/--force) option.

Configuring sqlean

CLI options are for options that can change from one run of sqlean to another. Project level configuration will not change from one run to another, and must be set in pyproject.toml. One exception is the TARGET directory or file that can be both a project level setting and also change from one run to another, so it can appear in both pyproject.toml or as a CLI argument.

In pyproject.toml, sqlean options go in a [tool.sqlean] section. The options are:

  • includes: a single path or list of paths to include when sqlean is run. This will be over-ridden by the TARGET argument in the CLI when it is supplied. The default value is None, in which case TARGET must be used.
  • indent_size: an integer that indicates the number of spaces in the indent. The default value is 2.
  • whisper: (NOT YET IMPLEMENTED) a boolean which should be set to true to make all SQL reserved words and function names all lower case. The default value is false, which makes all SQL reserved words and function names all-caps.
  • max_line_length: (NOT YET IMPLEMENTED) an integer value which adds additional line breaks if a line is longer than this value. The default value is -1 to indicate that no maximum will be applied.
  • dialect: (NOT YET IMPLEMENTED) a string to indicate the SQL dialect used in the project. The default value is "BIGQUERY"

Why

In the last few years, the usage of SQL in big data processing has steadily increased with the emergence of computational engines like Google Cloud BigQuery, Amazon Redshift, Snowflake and others with SQL interfaces.

Leveraging on the growth of SQL, dbt has gained wide-spread adoption as a tool to orchestrate the running of SQL queries. While analytics and feature generation pipelines were previously written primarily in programming languages like Python or Scala, they are now mostly dbt code: a combination of SQL and jinja templating.

The tooling around dbt code does not have the maturity as tooling around older languages like Python. We see code formatting and styling as a critical part of tooling around a language: consistently formatted code makes it easier to collaborate within a team by making it faster and easier to read other people's code.

Black has emerged as the most popular code formatter for Python for a couple reasons:

  1. There are few options so there is little to argue about when setting it up for the first time.
  2. If two versions of a piece of code have the same AST, they will have the same black formatted output (modulo minor differences in line spacing).

The strong consistency in formatting means that sometimes the output looks a bit strange or at least different than what you're used to. But over time our brains adjust and it becomes faster to read black formatted code.

sqlean aspires to play the same rule with SQL/dbt queries. By using sqlean, you'll give up control on exactly how your queries are formatted and styled, but you'll gain strong consistency in the formatting. No more nitpicking in code reviews and no more getting used to other team members' query styles.

Current state of sqlean

These are very early, experimental days in the development of sqlean so these numbers will change. On a production code base of 1,615 queries, 58.7% could be parsed and formatted and the other 42.3% could not be parsed. It takes 8.8 seconds for sqlean to attempt to parse and fix these queries. The time increases as more queries can be parsed, so it is expected to take roughly 15 seconds when most or all of the queries can be parsed.

Alternatives

There are other SQL formatting/linting tools out there and some may be more suitable for you.

SQLFluff

SQLFluff is a dialect-flexible and configurable SQL linter. Designed with ELT applications in mind, SQLFluff also works with Jinja templating and dbt. SQLFluff will auto-fix most linting errors, allowing you to focus your time on what matters.

SQLFluff is the main alternative to sqlean with wide community adoption and active contributions (4K+ Github stars and 150+ contributors). SQLFluff has a wide array of rules that can be customized. Time to auto-format queries can vary widely. Running SQLFluff on the production code base of 1,615 queries took too long, so a subset of 617 queries was used for comparison:

| | sqlean | SQLFluff | | :---------------------- | -----: | -------: | | Time to auto-format (s) | 3.8 | 1295.2 | | # files auto-formatted | 388 | 228 | | Version | 0.0.3 | 0.13.2 |

While SQLFluff's performance depends on the starting queries, on this sample sqlean was able to run nearly 350 times faster than SQLFluff and was able to auto-format 70% more files than SQLFluff. SQLFluff was run with the default rules using the command: sqlfluff fix -f --dialect bigquery.

The large difference in run-times is likely due to

Related Skills

View on GitHub
GitHub Stars10
CategoryData
Updated2y ago
Forks0

Languages

Python

Security Score

80/100

Audited on Feb 4, 2024

No findings