Duckdbex

This is an Elixir Library that embeds C++ DuckDB database into your Elixir application.

DuckDB is an in-process SQL database management system designed to support analytical query workloads, also known as Online analytical processing (OLAP). It has no external dependencies, neither for compilation nor during run-time and completely embedded (like Sqlite) within a host (BEAM in our case) process.

To find out where and why you could use DuckDB, please see why DuckDB section of DuckDB docs.

Also, you may find useful the DuckDB documentation.

The library uses amalgamation of the DuckDB sources, which combine all sources into two files duckdb.hpp and duckdb.cpp. This is the standard source distribution of libduckdb. Amalgamation allows to work with CSV and Parquet but it does not yet include DuckDB extensions like JSON, Full Text Search, HTTPFS, SQLite Scanner, Postgres Scanner and Substrait.

All NIF functions implemented as Dirty NIF

Online HexDocs

Installation

If available in Hex, the package can be installed by adding duckdbex to your list of dependencies in mix.exs:

def deps do
  [
    {:duckdbex, "~> 0.3.9"}
  ]
end

Usage

Using the duckdbex is quite simple. You need to open database file, create connection, run a query and fetch the results. livebook

Open database

To open the DuckDB database, you must specify the path to the DuckDB file

# Open an existing database file
{:ok, db} = Duckdbex.open("exis_movies.duckdb")

If specified database file does not exist, a new database file with the given name will be created automatically.

# If the specified file does not exist the new database will be created
{:ok, db} = Duckdbex.open("not_yet_exist_my_new_duckdb_database")

If you do not specify any database file the database will be created in the memory. Note that for an in-memory database no data is persisted to disk (i.e. all data is lost when you exit the app or database|connection object out of scope).

# Just create database in the memory
{:ok, db} = Duckdbex.open()

The DuckDB has a number of different options and could be configured by passing the Duckdbex.Config struct into the Duckdbex.open/1 or Duckdbex.open/2.

{:ok, _db} = Duckdbex.open(%Duckdbex.Config{checkpoint_wal_size: 8388608})

Create Connection

With the DuckDB instance, you can create one or many Connection instances using the Duckdbex.connection/1 function. In fact, the DuckDB connection is the native OS thread (not a lightweight Elixir process). While individual connections are thread-safe, they will be locked during querying. So, it is recommended to use the different DuckDB connections in the different Elixir processes to allow for the best parallel performance.

{:ok, db} = Duckdbex.open()

# Create a connection to the opened database
{:ok, conn} = Duckdbex.connection(db)

Make a Query

To make a query you need call Duckdbex.query/2 passing connection reference and query string. Query parameters can be passed as list into Duckdbex.query/3. The query call will return the reference to the result.

{:ok, db} = Duckdbex.open()
{:ok, conn} = Duckdbex.connection(db)

# Run a query and get the reference to the result
{:ok, result_ref} = Duckdbex.query(conn, "SELECT 1;")

# Run a query with parameters
{:ok, result_ref} = Duckdbex.query(conn, "SELECT 1 WHERE $1 = 1;", [1])

Fetch Result

To get data from result you need pass result reference from the Duckdbex.query/2 or Duckdbex.query/3 call into the Duckdbex.fetch_all/1 or Duckdbex.fetch_chunk/1. The Duckdbex.fetch_all/1 will return all the data at once. To get data chunk by chunk you should call Duckdbex.fetch_chunk/1.

# Run a query and get the reference to the result
{:ok, result_ref} = Duckdbex.query(conn, """
  SELECT userId, movieId, rating FROM ratings WHERE userId = $1;
""", [1])

# Get all the data from the result reference at once
Duckdbex.fetch_all(result_ref)
# => [[userId: 1, movieId: 1, rating: 6], [userId: 1, movieId: 2, rating: 12]]

or by chunks

# Run a query and get the reference to the result
{:ok, result_ref} = Duckdbex.query(conn, "SELECT * FROM ratings;")

# fetch chunk
Duckdbex.fetch_chunk(result_ref)
# => [[userId: 1, movieId: 1, rating: 6], [userId: 1, movieId: 2, rating: 12]...]

# fetch next chunk
Duckdbex.fetch_chunk(result_ref)
# => [<rows>]

...

# the data is over and fetch_chunk returns the empty list
Duckdbex.fetch_chunk(result_ref)
# => []

Closing connection, database and releasing resources

All opened database/connecions/results refs will be closed/released automatically as soon as the ref for an object (db, conn, result_ref) will be thrown away. For example:

Lets open the database:

# Open an existing database file
{:ok, db} = Duckdbex.open("exis_movies.duckdb")
#Reference<0.1076596279.3008626690.232411>

Now the db holds the reference #Reference<0.1076596279.3008626690.232411> to the underlying database object. If you throw away the 'db' ref to the underlying database object the database will be closed automatically. Lets simulate this via 'assignment' to db some stub value

# throw away the ref to the database
db = "forcing closing the database"

Now the db holds the "forcing closing the database" binary and there is no any 'variable' in out code what holds the ref #Reference<0.1076596279.3008626690.232411> to the database. So, technically speaking, ref count to underlying database object is 0. Erlang automatically calls the destructor for database object and it will be closed correctly. So, if, for example, you holds the db ref in GenServer state the db will be closed automatically if GenServer will be terminated/crashed. You don't need to call some function to close database.

But what if you need to close the database/connection/result_ref explicitly, for example, you want to close the database (flush all underlying db buffers to disk) and when archive the db file. To prevent using the strange code like db = "forcing closing the database" there is Duckdbex.release(resource) function to explicitly closing any underlying DuckDB resource:

iex> {:ok, db} = Duckdbex.open("my_database.duckdb", %Duckdbex.Config{})
iex> {:ok, conn} = Duckdbex.connection(db)
iex> {:ok, res} = Duckdbex.query(conn, "SELECT 1 WHERE $1 = 1;", [1])
iex> :ok = Duckdbex.release(res)
iex> :ok = Duckdbex.release(conn)
iex> :ok = Duckdbex.release(db)

Now, all are explicitly closed.

Prepared statement

Prepared statement speeding up queries that will be executed many times with different parameters. Also it allows to avoid string concatenation/SQL injection attacks.

# Prepare statement
{:ok, stmt_ref} = Duckdbex.prepare_statement(conn, "SELECT 1;")

# Execute statement
{:ok, result_ref} = Duckdbex.execute_statement(stmt_ref)

# Fetch result
Duckdbex.fetch_all(result_ref)
# => [[1]]

or with parameters

# Prepare statement
{:ok, stmt_ref} = Duckdbex.prepare_statement(conn, "SELECT * FROM ratings WHERE userId = $1;")

# Execute statement
{:ok, result_ref} = Duckdbex.execute_statement(stmt_ref, [1])
# fetch result ...

# Execute statement
{:ok, result_ref} = Duckdbex.execute_statement(stmt_ref, [42])
# fetch result ...

Importing Data

DuckDB provides several methods that allows you to easily and efficiently insert data to the database.

Insert Statements
CSV Files
Parquet Files
Appender

Insert Statements

This is standard way of inserting data into relational database, but DuckDB is not recommended to use this method if you are inserting more than a few records. See details. To insert bulk data into database, please, use Appender.

{:ok, db} = Duckdbex.open()
{:ok, conn} = Duckdbex.connection(db)
{:ok, _res} = Duckdbex.query(conn, "CREATE TABLE people(id INTEGER, name VARCHAR);")

{:ok, _res} = Duckdbex.query(conn, "INSERT INTO people VALUES (1, 'Mark'), (2, 'Hannes');")

A more detailed description together with syntax diagram can be found here.

CSV Files

DuckDB has an embedded CSV reader that allows to load CSV files directly into database (escaping data transferring from Elixir to NIF, so ERTS doesn't involved in this). Also, DuckDB supports compressed CSV files, e.g. a gzipped file like my_csv_file.csv.gz, etc.

For example we have a test.csv CSV file:

FlightDate|UniqueCarrier|OriginCityName|DestCityName
1988-01-01|AA|New York, NY|Los Angeles, CA
1988-01-02|AA|New York, NY|Los Angeles, CA
1988-01-03|AA|New York, NY|Los Angeles, CA

Let's load this file into database:

{:ok, db} = Duckdbex.open()
{:ok, conn} = Duckdbex.connection(db)

# read a CSV file from disk, auto-infer options
{:ok, res} = Duckdbex.query conn, "SELECT * FROM 'test.csv';"

Duckdbex.fetch_all(res)
# will result in
[
  [{1988, 1, 1}, "AA", "New York, NY", "Los Angeles, CA"],
  [{1988, 1, 2}, "AA", "New York, NY", "Los Angeles, CA"],
  [{1988, 1, 3}, "AA", "New York, NY", "Los Angeles, CA"]
]

or we could export data directly into CS

Duckdbex

Install / Use

README

Duckdbex

Installation

Usage

Open database

Create Connection

Make a Query

Fetch Result

Closing connection, database and releasing resources

Prepared statement

Importing Data

Insert Statements

CSV Files