SkillAgentSearch skills...

Icedb

An in-process Parquet merge engine for better data warehousing in S3 with MVCC

Install / Use

/learn @danthegoodman1/Icedb
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

IceDB

IceDB is an in-process Parquet merge engine for better data warehousing in S3, using only S3, with MVCC, time travel queries, and atomic operations.

No Spark, no JVM, no data warehouse experience required.

IceDB runs stateless, and stores data in easily readable formats to allow any language or framework to parse the log (jsonl) and read the data (parquet), making it far easier to insert, migrate, query, run, and scale than alternatives.

It's queryable by anything that understands parquet, and runs 54x cheaper than managed solutions such as BigQuery, Snowflake, and Athena.

IceDB tracks table schemas as standard SQL types, supports dynamic schema evolution, and divides your data into tables and partitions.

IceDB merges parquet files and manages tombstone cleanup to optimize your data storage for faster queries, very similar to what systems like ClickHouse do under the hood. Except IceDB does it effectively stateless: All state and storage is in S3. This makes it extremely easy to run and scale. When you need to introduce concurrent mutations at the table level, you can introduce coordination through exclusive locks. The IceDB log format also uses the widely understood newline-delimited JSON format, making it trivial to read from any language.

It retains many of the features of modern OLAP systems (such as ClickHouse’s Materialized Views), adds some new ones, and makes it way easier to build scalable data systems with a focus on true multi-tenancy.

IceDB can replace systems like BigQuery, Athena, and Snowflake, but with clever data design can also replace provisioned solutions such as a ClickHouse cluster, Redshift, and more.

Query engines such as DuckDB, ClickHouse, CHDB, Datafusion, Pandas, or custom parquet readers in any language can easily read IceDB data in hundreds of milliseconds, and even faster when combined with the IceDB S3 Proxy for transparent queries (the client just thinks it's S3) like with the ClickHouse S3 function s3('https://icedb-s3-proxy/**/*.parquet') or DuckDB's read_parquet('s3://icedb-s3-proxy/**/*.parquet').

<!-- TOC --> <!-- TOC -->

Quick Start

Clone the repo:

git clone https://github.com/danthegoodman1/icedb
docker compose up -d # starts local minio server
pip install git+https://github.com/danthegoodman1/icedb duckdb

Use pip3 if you are on macOS or have a dual installation

import duckdb
from icedb.log import S3Client, IceLogIO
from icedb import IceDBv3, CompressionCodec
from datetime import datetime
from time import time

# create an s3 client to talk to minio
s3c = S3Client(s3prefix="example", s3bucket="testbucket", s3region="us-east-1", s3endpoint="http://localhost:9000",
               s3accesskey="user", s3secretkey="password")

example_events = [
    {
        "ts": 1686176939445,
        "event": "page_load",
        "user_id": "user_a",
        "properties": {
            "page_name": "Home"
        }
    }, {
        "ts": 1676126229999,
        "event": "page_load",
        "user_id": "user_b",
        "properties": {
            "page_name": "Home"
        }
    }, {
        "ts": 1686176939666,
        "event": "page_load",
        "user_id": "user_a",
        "properties": {
            "page_name": "Settings"
        }
    }, {
        "ts": 1686176941445,
        "event": "page_load",
        "user_id": "user_a",
        "properties": {
            "page_name": "Home"
        }
    }
]


def part_func(row: dict) -> str:
    """
    Partition by user_id, date
    """
    row_time = datetime.utcfromtimestamp(row['ts'] / 1000)
    part = f"u={row['user_id']}/d={row_time.strftime('%Y-%m-%d')}"
    return part


# Initialize the client
ice = IceDBv3(
    part_func,
    ['event', 'ts'],  # Sort by event, then timestamp of the event within the data part
    "us-east-1",
    "user",
    "password",
    "http://localhost:9000",
    s3c,
    "dan-mbp",
    s3_use_path=True,  # needed for local minio
    compression_codec=CompressionCodec.ZSTD  # Let's force a higher compression level, default is SNAPPY
)

# Insert records
inserted = ice.insert(example_events)
print('inserted', inserted)

# Read the log state
log = IceLogIO("demo-host")
_, file_markers, _, _ = log.read_at_max_time(s3c, round(time() * 1000))
alive_files = list(filter(lambda x: x.tombstone is None, file_markers))

# Setup duckdb for querying local minio
ddb = duckdb.connect(":memory:")
ddb.execute("install httpfs")
ddb.execute("load httpfs")
ddb.execute("SET s3_region='us-east-1'")
ddb.execute("SET s3_access_key_id='user'")
ddb.execute("SET s3_secret_access_key='password'")
ddb.execute("SET s3_endpoint='localhost:9000'")
ddb.execute("SET s3_use_ssl='false'")
ddb.execute("SET s3_url_style='path'")

# Query alive files
query = ("select user_id, count(*), (properties::JSON)->>'page_name' as page "
         "from read_parquet([{}]) "
         "group by user_id, page "
         "order by count(*) desc").format(
    ', '.join(list(map(lambda x: "'s3://" + ice.data_s3c.s3bucket + "/" + x.path + "'", alive_files)))
)
print(ddb.sql(query))
inserted [{"p": "example/_data/u=user_a/d=2023-06-07/c2bc1eef-b2cd-404a-9ec6-097e27d3130f.parquet", "b": 693, "t": 1702822195892}, {"p": "example/_data/u=user_b/d=2023-02-11/2d8cb9b1-450f-455f-84e0-527b8fb35d5f.parquet", "b": 585, "t": 1702822195894}]
┌─────────┬──────────────┬──────────┐
│ user_id │ count_star() │   page   │
│ varchar │    int64     │ varchar  │
├─────────┼──────────────┼──────────┤
│ user_a  │            2 │ Home     │
│ user_a  │            1 │ Settings │
│ user_b  │            1 │ Home     │
└─────────┴──────────────┴──────────┘

For more in-depth examples, see the Examples section

How does IceDB work?

Inserts, merges, and tombstone cleanup are powered by Python and DuckDB. IceDB runs stateless with a log in S3, meaning that you only pay for storage and compute during operations, enabling true serverless analytical processing. It does so in an open and easily readable format to allow for any language or framework to parse the icedb log (jsonl) and read the data (parquet).

The IceDB log keeps track of alive data files, as well as the running schema which is updated via insertion. Query engines such as DuckDB, ClickHouse, CHDB, Datafusion, Pandas, or custom parquet readers in any language can easily read IceDB data in hundreds milliseconds, especially when combined with the IceDB S3 Proxy.

See more in ARCHITECTURE.md

Examples

See the examples/ directory for many examples like Materialized Views, custom merge queries, schema validation before insert, and more.

Performance test

IceDB can easily insert hundreds of thousands of rows per second per instance, and query engines can query upwards of hundreds of millions of rows per second.

Performance depends on a variety of things such as query engine, network/disk, and how efficiently your data is merged at query time. Query performance depends on how efficiently the query engine can read Parquet. I've reached 1B rows/s with ClickHouse using IceDB.

Running multiple single-threaded insert web servers will be more performant than a single multithreaded web server due to the overhead of multithreading in python web servers, as seen in the performance test differences bet

View on GitHub
GitHub Stars153
CategoryData
Updated26d ago
Forks6

Languages

Python

Security Score

85/100

Audited on Mar 8, 2026

No findings