Chdb
chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse
Install / Use
/learn @chdb-io/ChdbREADME
chDB
chDB is an in-process SQL OLAP Engine powered by ClickHouse [^1] For more details: The birth of chDB
Features
- 🐼 Pandas-compatible DataStore API - Use familiar pandas syntax with ClickHouse performance
- In-process SQL OLAP Engine, powered by ClickHouse
- No need to install ClickHouse
- Minimized data copy from C++ to Python with python memoryview
- Input&Output support Parquet, CSV, JSON, Arrow, ORC and 60+more formats
- Support Python DB API 2.0
Get Started on Hex
- 📖 <a href="https://app.hex.tech/partnerships/app/chDB-Tutorial-032XsQ4qoKtlXxcw49joav/latest" target="_blank"><b>Getting Started Tutorial</b></a> — set up your first connection
- 🚀 <a href="https://app.hex.tech/signup/clickhouse-30" target="_blank"><b>Extended 30-day Hex Trial</b></a> — full access to ClickHouse integrations
Arch
<div align="center"> <img src="https://github.com/chdb-io/chdb/raw/main/docs/_static/arch-chdb3.png" width="450"> </div>Installation
Currently, chDB supports Python 3.9+ on macOS and Linux (x86_64 and ARM64).
pip install chdb
<br>
🐼 DataStore: Pandas-Compatible API (Recommended)
DataStore provides a familiar pandas-like API with automatic SQL generation and ClickHouse performance. Write pandas code, get SQL performance - no learning curve required.
Quick Start (30 seconds)
Just change your import - use the pandas API you already know:
import datastore as pd # That's it! Use pandas API as usual
# Create a DataFrame - works exactly like pandas
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'city': ['NYC', 'LA', 'NYC', 'LA']
})
# Filter with familiar pandas syntax
result = df[df['age'] > 26]
print(result)
# name age city
# 1 Bob 30 LA
# 2 Charlie 35 NYC
# 3 Diana 28 LA
# GroupBy works too
print(df.groupby('city')['age'].mean())
# city
# LA 29.0
# NYC 30.0
✨ Zero code changes required. All operations are lazy - they're recorded and compiled into optimized SQL, executed only when results are needed.
Why DataStore?
| Feature | pandas | DataStore | |---------|--------|-----------| | API | ✅ Familiar | ✅ Same pandas API | | Large datasets | ❌ Memory limited | ✅ SQL-optimized | | Learning curve | ✅ Easy | ✅ None - same syntax | | Performance | ❌ Single-threaded | ✅ ClickHouse engine |
Architecture
<div align="center"> <img src="https://github.com/chdb-io/chdb/raw/main/docs/_static/datastore_architecture.png" width="700"> </div>DataStore uses lazy evaluation with dual-engine execution:
- Lazy Operation Chain: Operations are recorded, not executed immediately
- Smart Engine Selection: QueryPlanner routes each segment to optimal engine (chDB for SQL, Pandas for complex ops)
- Intermediate Caching: Results cached at each step for fast iterative exploration
Working with Files
from datastore import DataStore
# Load any file format
ds = DataStore.from_file("data.parquet") # or CSV, JSON, ORC...
# Explore your data
print(ds.head()) # Preview first 5 rows
print(ds.shape) # (rows, columns)
print(ds.columns) # Column names
# Build queries with method chaining
result = (ds
.select("product", "revenue", "date")
.filter(ds.revenue > 1000)
.sort("revenue", ascending=False)
.head(10))
print(result)
Query Any Data Source
from datastore import DataStore
# S3 (with anonymous access)
ds = DataStore.uri("s3://bucket/data.parquet?nosign=true")
# MySQL
ds = DataStore.uri("mysql://user:pass@localhost:3306/mydb/users")
# PostgreSQL
ds = DataStore.uri("postgresql://user:pass@localhost:5432/mydb/products")
# And more: SQLite, MongoDB, ClickHouse, HDFS, Azure, GCS...
Pandas API Coverage
DataStore implements comprehensive pandas compatibility:
| Category | Coverage | |----------|----------| | DataFrame methods | 209 methods | | Series.str accessor | 56 methods | | Series.dt accessor | 42+ methods | | ClickHouse SQL functions | 334 functions |
# All these pandas methods work:
df.drop(columns=['unused'])
df.fillna(0)
df.assign(revenue=lambda x: x['price'] * x['quantity'])
df.sort_values('revenue', ascending=False)
df.groupby('category').agg({'revenue': 'sum', 'quantity': 'mean'})
df.merge(other_df, on='id')
df.pivot_table(values='sales', index='date', columns='product')
# ... and 200+ more
String and DateTime Operations
# String operations via .str accessor
ds['name'].str.upper()
ds['email'].str.contains('@gmail')
ds['text'].str.replace('old', 'new')
# DateTime operations via .dt accessor
ds['date'].dt.year
ds['date'].dt.month
ds['timestamp'].dt.hour
Documentation
- Pandas Compatibility Guide - Full list of supported methods
- Function Reference - 334 ClickHouse SQL functions
- Migration Guide - Step-by-step guide for pandas users
<br>
SQL API
For users who prefer SQL or need advanced ClickHouse features:
Run in command line
python3 -m chdb SQL [OutputFormat]
python3 -m chdb "SELECT 1,'abc'" Pretty
<br>
Data Input
The following methods are available to access on-disk and in-memory data formats:
<details> <summary><h4>🗂️ Connection based API</h4></summary>import chdb
# Create a connection (in-memory by default)
conn = chdb.connect(":memory:")
# Or use file-based: conn = chdb.connect("test.db")
# Create a cursor
cur = conn.cursor()
# Execute queries
cur.execute("SELECT number, toString(number) as str FROM system.numbers LIMIT 3")
# Fetch data in different ways
print(cur.fetchone()) # Single row: (0, '0')
print(cur.fetchmany(2)) # Multiple rows: ((1, '1'), (2, '2'))
# Get column information
print(cur.column_names()) # ['number', 'str']
print(cur.column_types()) # ['UInt64', 'String']
# Use the cursor as an iterator
cur.execute("SELECT number FROM system.numbers LIMIT 3")
for row in cur:
print(row)
# Always close resources when done
cur.close()
conn.close()
For more details, see examples/connect.py.
</details> <details> <summary><h4>🗂️ Query On File</h4> (Parquet, CSV, JSON, Arrow, ORC and 60+)</summary>You can execute SQL and return desired format data.
import chdb
res = chdb.query('select version()', 'Pretty'); print(res)
Work with Parquet or CSV
# See more data type format in tests/format_output.py
res = chdb.query('select * from file("data.parquet", Parquet)', 'JSON'); print(res)
res = chdb.query('select * from file("data.csv", CSV)', 'CSV'); print(res)
print(f"SQL read {res.rows_read()} rows, {res.bytes_read()} bytes, storage read {res.storage_rows_read()} rows, {res.storage_bytes_read()} bytes, elapsed {res.elapsed()} seconds")
Parameterized queries
import chdb
df = chdb.query(
"SELECT toDate({base_date:String}) + number AS date "
"FROM numbers({total_days:UInt64}) "
"LIMIT {items_per_page:UInt64}",
"DataFrame",
params={"base_date": "2025-01-01", "total_days": 10, "items_per_page": 2},
)
print(df)
# date
# 0 2025-01-01
# 1 2025-01-02
Query progress (progress=auto)
import chdb
# Connection API
conn = chdb.connect(":memory:?progress=auto")
conn.query("SELECT sum(number) FROM numbers_mt(1e10) GROUP BY number % 10 SETTINGS max_threads=4")
import chdb
# One-shot API
res = chdb.query(
"SELECT sum(number) FROM numbers_mt(1e10) GROUP BY number % 10 SETTINGS max_threads=4",
options={"progress": "auto"},
)
progress=auto behavior:
- In terminal runs: show textual progress updates in the terminal.
- Jupyter/Marimo notebook: render progress bar in notebook output.
Other progress options:
- Progress bar:
progress=tty: write progress to terminal TTY.progress=err: write progress tostderr.progress=off: disable progress bar output.
- Progress table (terminal output):
progress-table=tty: write progress table to terminal TTY.progress-table=err: write progress table tostderr.progress-table=off: disable progress table output.
Pandas dataframe output
# See more in https://clickhouse.com/docs/en/interfaces/formats
chdb.query('select * from file("data.parquet", Parquet)', 'Dataframe')
</details>
<details>
<summary><h4>🗂️ Query On Table</h4> (Pandas DataFrame, Parquet file/bytes, Arrow bytes) </summary>
Query On Pandas DataFrame
import chdb.dataframe as cdf
import pandas as pd
# Join 2 DataFrames
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': ["one", "two", "three"]})
df2 = pd.DataFrame({'c': [1, 2, 3], 'd': ["①", "②", "③"]})
ret_tbl = cdf.query(sql="sele
