<div align="center"> <a href="https://clickhouse.com/blog/chdb-joins-clickhouse-family">📢 chDB joins the ClickHouse family 🐍+🚀</a> </div> <div align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://github.com/chdb-io/chdb/raw/main/docs/_static/snake-chdb-dark.png" height="130"> <img src="https://github.com/chdb-io/chdb/raw/main/docs/_static/snake-chdb.png" height="130"> </picture>

</div>

chDB

chDB is an in-process SQL OLAP Engine powered by ClickHouse [^1] For more details: The birth of chDB

Features

🐼 Pandas-compatible DataStore API - Use familiar pandas syntax with ClickHouse performance
In-process SQL OLAP Engine, powered by ClickHouse
No need to install ClickHouse
Minimized data copy from C++ to Python with python memoryview
Input&Output support Parquet, CSV, JSON, Arrow, ORC and 60+more formats
Support Python DB API 2.0

Get Started on Hex

📖 <a href="https://app.hex.tech/partnerships/app/chDB-Tutorial-032XsQ4qoKtlXxcw49joav/latest" target="_blank"><b>Getting Started Tutorial</b></a> — set up your first connection
🚀 <a href="https://app.hex.tech/signup/clickhouse-30" target="_blank"><b>Extended 30-day Hex Trial</b></a> — full access to ClickHouse integrations

Arch

Installation

Currently, chDB supports Python 3.9+ on macOS and Linux (x86_64 and ARM64).

pip install chdb

<br>

🐼 DataStore: Pandas-Compatible API (Recommended)

DataStore provides a familiar pandas-like API with automatic SQL generation and ClickHouse performance. Write pandas code, get SQL performance - no learning curve required.

Quick Start (30 seconds)

Just change your import - use the pandas API you already know:

import datastore as pd  # That's it! Use pandas API as usual

# Create a DataFrame - works exactly like pandas
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NYC', 'LA', 'NYC', 'LA']
})

# Filter with familiar pandas syntax
result = df[df['age'] > 26]
print(result)
#       name  age city
# 1      Bob   30   LA
# 2  Charlie   35  NYC
# 3    Diana   28   LA

# GroupBy works too
print(df.groupby('city')['age'].mean())
# city
# LA     29.0
# NYC    30.0

✨ Zero code changes required. All operations are lazy - they're recorded and compiled into optimized SQL, executed only when results are needed.

Why DataStore?

| Feature | pandas | DataStore | |---------|--------|-----------| | API | ✅ Familiar | ✅ Same pandas API | | Large datasets | ❌ Memory limited | ✅ SQL-optimized | | Learning curve | ✅ Easy | ✅ None - same syntax | | Performance | ❌ Single-threaded | ✅ ClickHouse engine |

Architecture

DataStore uses lazy evaluation with dual-engine execution:

Lazy Operation Chain: Operations are recorded, not executed immediately
Smart Engine Selection: QueryPlanner routes each segment to optimal engine (chDB for SQL, Pandas for complex ops)
Intermediate Caching: Results cached at each step for fast iterative exploration

Working with Files

from datastore import DataStore

# Load any file format
ds = DataStore.from_file("data.parquet")  # or CSV, JSON, ORC...

# Explore your data
print(ds.head())       # Preview first 5 rows
print(ds.shape)        # (rows, columns)
print(ds.columns)      # Column names

# Build queries with method chaining
result = (ds
    .select("product", "revenue", "date")
    .filter(ds.revenue > 1000)
    .sort("revenue", ascending=False)
    .head(10))

print(result)

Query Any Data Source

from datastore import DataStore

# S3 (with anonymous access)
ds = DataStore.uri("s3://bucket/data.parquet?nosign=true")

# MySQL
ds = DataStore.uri("mysql://user:pass@localhost:3306/mydb/users")

# PostgreSQL
ds = DataStore.uri("postgresql://user:pass@localhost:5432/mydb/products")

# And more: SQLite, MongoDB, ClickHouse, HDFS, Azure, GCS...

Pandas API Coverage

DataStore implements comprehensive pandas compatibility:

| Category | Coverage | |----------|----------| | DataFrame methods | 209 methods | | Series.str accessor | 56 methods | | Series.dt accessor | 42+ methods | | ClickHouse SQL functions | 334 functions |

# All these pandas methods work:
df.drop(columns=['unused'])
df.fillna(0)
df.assign(revenue=lambda x: x['price'] * x['quantity'])
df.sort_values('revenue', ascending=False)
df.groupby('category').agg({'revenue': 'sum', 'quantity': 'mean'})
df.merge(other_df, on='id')
df.pivot_table(values='sales', index='date', columns='product')
# ... and 200+ more

String and DateTime Operations

# String operations via .str accessor
ds['name'].str.upper()
ds['email'].str.contains('@gmail')
ds['text'].str.replace('old', 'new')

# DateTime operations via .dt accessor  
ds['date'].dt.year
ds['date'].dt.month
ds['timestamp'].dt.hour

Documentation

Pandas Compatibility Guide - Full list of supported methods
Function Reference - 334 ClickHouse SQL functions
Migration Guide - Step-by-step guide for pandas users

<br>

SQL API

For users who prefer SQL or need advanced ClickHouse features:

Run in command line

python3 -m chdb SQL [OutputFormat]

python3 -m chdb "SELECT 1,'abc'" Pretty

<br>

Data Input

The following methods are available to access on-disk and in-memory data formats:

<details> <summary><h4>🗂️ Connection based API</h4></summary>

import chdb

# Create a connection (in-memory by default)
conn = chdb.connect(":memory:")
# Or use file-based: conn = chdb.connect("test.db")

# Create a cursor
cur = conn.cursor()

# Execute queries
cur.execute("SELECT number, toString(number) as str FROM system.numbers LIMIT 3")

# Fetch data in different ways
print(cur.fetchone())    # Single row: (0, '0')
print(cur.fetchmany(2))  # Multiple rows: ((1, '1'), (2, '2'))

# Get column information
print(cur.column_names())  # ['number', 'str']
print(cur.column_types())  # ['UInt64', 'String']

# Use the cursor as an iterator
cur.execute("SELECT number FROM system.numbers LIMIT 3")
for row in cur:
    print(row)

# Always close resources when done
cur.close()
conn.close()

For more details, see examples/connect.py.

</details> <details> <summary><h4>🗂️ Query On File</h4> (Parquet, CSV, JSON, Arrow, ORC and 60+)</summary>

You can execute SQL and return desired format data.

import chdb
res = chdb.query('select version()', 'Pretty'); print(res)

Work with Parquet or CSV

# See more data type format in tests/format_output.py
res = chdb.query('select * from file("data.parquet", Parquet)', 'JSON'); print(res)
res = chdb.query('select * from file("data.csv", CSV)', 'CSV');  print(res)
print(f"SQL read {res.rows_read()} rows, {res.bytes_read()} bytes, storage read {res.storage_rows_read()} rows, {res.storage_bytes_read()} bytes, elapsed {res.elapsed()} seconds")

Parameterized queries

import chdb

df = chdb.query(
    "SELECT toDate({base_date:String}) + number AS date "
    "FROM numbers({total_days:UInt64}) "
    "LIMIT {items_per_page:UInt64}",
    "DataFrame",
    params={"base_date": "2025-01-01", "total_days": 10, "items_per_page": 2},
)
print(df)
#         date
# 0 2025-01-01
# 1 2025-01-02

Query progress (`progress=auto`)

import chdb

# Connection API
conn = chdb.connect(":memory:?progress=auto")
conn.query("SELECT sum(number) FROM numbers_mt(1e10) GROUP BY number % 10 SETTINGS max_threads=4")

import chdb

# One-shot API
res = chdb.query(
    "SELECT sum(number) FROM numbers_mt(1e10) GROUP BY number % 10 SETTINGS max_threads=4",
    options={"progress": "auto"},
)

progress=auto behavior:

In terminal runs: show textual progress updates in the terminal.
Jupyter/Marimo notebook: render progress bar in notebook output.

Other progress options:

Progress bar:
- progress=tty: write progress to terminal TTY.
- progress=err: write progress to stderr.
- progress=off: disable progress bar output.
Progress table (terminal output):
- progress-table=tty: write progress table to terminal TTY.
- progress-table=err: write progress table to stderr.
- progress-table=off: disable progress table output.

Pandas dataframe output

# See more in https://clickhouse.com/docs/en/interfaces/formats
chdb.query('select * from file("data.parquet", Parquet)', 'Dataframe')

</details> <details> <summary><h4>🗂️ Query On Table</h4> (Pandas DataFrame, Parquet file/bytes, Arrow bytes) </summary>

Query On Pandas DataFrame

import chdb.dataframe as cdf
import pandas as pd
# Join 2 DataFrames
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': ["one", "two", "three"]})
df2 = pd.DataFrame({'c': [1, 2, 3], 'd': ["①", "②", "③"]})
ret_tbl = cdf.query(sql="sele

Chdb

Install / Use

README

chDB

Features

Get Started on Hex

Arch

Installation

🐼 DataStore: Pandas-Compatible API (Recommended)

Quick Start (30 seconds)

Why DataStore?

Architecture

Working with Files

Query Any Data Source

Pandas API Coverage

String and DateTime Operations

Documentation

SQL API

Run in command line

Data Input

Work with Parquet or CSV

Parameterized queries

Query progress (`progress=auto`)

Pandas dataframe output

Query On Pandas DataFrame

Chdb

Install / Use

README

chDB

Features

Get Started on Hex

Arch

Installation

🐼 DataStore: Pandas-Compatible API (Recommended)

Quick Start (30 seconds)

Why DataStore?

Architecture

Working with Files

Query Any Data Source

Pandas API Coverage

String and DateTime Operations

Documentation

SQL API

Run in command line

Data Input

Work with Parquet or CSV

Parameterized queries

Query progress (progress=auto)

Pandas dataframe output

Query On Pandas DataFrame

Query progress (`progress=auto`)