SkillAgentSearch skills...

Chdb

chDB is an in-process OLAP SQL Engine 🚀 powered by ClickHouse

Install / Use

/learn @chdb-io/Chdb

README

<div align="center"> <a href="https://clickhouse.com/blog/chdb-joins-clickhouse-family">📢 chDB joins the ClickHouse family 🐍+🚀</a> </div> <div align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://github.com/chdb-io/chdb/raw/main/docs/_static/snake-chdb-dark.png" height="130"> <img src="https://github.com/chdb-io/chdb/raw/main/docs/_static/snake-chdb.png" height="130"> </picture>

Build X86 PyPI Downloads Discord Twitter

</div>

chDB

chDB is an in-process SQL OLAP Engine powered by ClickHouse [^1] For more details: The birth of chDB

Features

  • 🐼 Pandas-compatible DataStore API - Use familiar pandas syntax with ClickHouse performance
  • In-process SQL OLAP Engine, powered by ClickHouse
  • No need to install ClickHouse
  • Minimized data copy from C++ to Python with python memoryview
  • Input&Output support Parquet, CSV, JSON, Arrow, ORC and 60+more formats
  • Support Python DB API 2.0

Get Started on Hex

  • 📖 <a href="https://app.hex.tech/partnerships/app/chDB-Tutorial-032XsQ4qoKtlXxcw49joav/latest" target="_blank"><b>Getting Started Tutorial</b></a> — set up your first connection
  • 🚀 <a href="https://app.hex.tech/signup/clickhouse-30" target="_blank"><b>Extended 30-day Hex Trial</b></a> — full access to ClickHouse integrations

Arch

<div align="center"> <img src="https://github.com/chdb-io/chdb/raw/main/docs/_static/arch-chdb3.png" width="450"> </div>

Installation

Currently, chDB supports Python 3.9+ on macOS and Linux (x86_64 and ARM64).

pip install chdb
<br>

🐼 DataStore: Pandas-Compatible API (Recommended)

DataStore provides a familiar pandas-like API with automatic SQL generation and ClickHouse performance. Write pandas code, get SQL performance - no learning curve required.

Quick Start (30 seconds)

Just change your import - use the pandas API you already know:

import datastore as pd  # That's it! Use pandas API as usual

# Create a DataFrame - works exactly like pandas
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'city': ['NYC', 'LA', 'NYC', 'LA']
})

# Filter with familiar pandas syntax
result = df[df['age'] > 26]
print(result)
#       name  age city
# 1      Bob   30   LA
# 2  Charlie   35  NYC
# 3    Diana   28   LA

# GroupBy works too
print(df.groupby('city')['age'].mean())
# city
# LA     29.0
# NYC    30.0

✨ Zero code changes required. All operations are lazy - they're recorded and compiled into optimized SQL, executed only when results are needed.

Why DataStore?

| Feature | pandas | DataStore | |---------|--------|-----------| | API | ✅ Familiar | ✅ Same pandas API | | Large datasets | ❌ Memory limited | ✅ SQL-optimized | | Learning curve | ✅ Easy | ✅ None - same syntax | | Performance | ❌ Single-threaded | ✅ ClickHouse engine |

Architecture

<div align="center"> <img src="https://github.com/chdb-io/chdb/raw/main/docs/_static/datastore_architecture.png" width="700"> </div>

DataStore uses lazy evaluation with dual-engine execution:

  1. Lazy Operation Chain: Operations are recorded, not executed immediately
  2. Smart Engine Selection: QueryPlanner routes each segment to optimal engine (chDB for SQL, Pandas for complex ops)
  3. Intermediate Caching: Results cached at each step for fast iterative exploration

Working with Files

from datastore import DataStore

# Load any file format
ds = DataStore.from_file("data.parquet")  # or CSV, JSON, ORC...

# Explore your data
print(ds.head())       # Preview first 5 rows
print(ds.shape)        # (rows, columns)
print(ds.columns)      # Column names

# Build queries with method chaining
result = (ds
    .select("product", "revenue", "date")
    .filter(ds.revenue > 1000)
    .sort("revenue", ascending=False)
    .head(10))

print(result)

Query Any Data Source

from datastore import DataStore

# S3 (with anonymous access)
ds = DataStore.uri("s3://bucket/data.parquet?nosign=true")

# MySQL
ds = DataStore.uri("mysql://user:pass@localhost:3306/mydb/users")

# PostgreSQL
ds = DataStore.uri("postgresql://user:pass@localhost:5432/mydb/products")

# And more: SQLite, MongoDB, ClickHouse, HDFS, Azure, GCS...

Pandas API Coverage

DataStore implements comprehensive pandas compatibility:

| Category | Coverage | |----------|----------| | DataFrame methods | 209 methods | | Series.str accessor | 56 methods | | Series.dt accessor | 42+ methods | | ClickHouse SQL functions | 334 functions |

# All these pandas methods work:
df.drop(columns=['unused'])
df.fillna(0)
df.assign(revenue=lambda x: x['price'] * x['quantity'])
df.sort_values('revenue', ascending=False)
df.groupby('category').agg({'revenue': 'sum', 'quantity': 'mean'})
df.merge(other_df, on='id')
df.pivot_table(values='sales', index='date', columns='product')
# ... and 200+ more

String and DateTime Operations

# String operations via .str accessor
ds['name'].str.upper()
ds['email'].str.contains('@gmail')
ds['text'].str.replace('old', 'new')

# DateTime operations via .dt accessor  
ds['date'].dt.year
ds['date'].dt.month
ds['timestamp'].dt.hour

Documentation


<br>

SQL API

For users who prefer SQL or need advanced ClickHouse features:

Run in command line

python3 -m chdb SQL [OutputFormat]

python3 -m chdb "SELECT 1,'abc'" Pretty
<br>

Data Input

The following methods are available to access on-disk and in-memory data formats:

<details> <summary><h4>🗂️ Connection based API</h4></summary>
import chdb

# Create a connection (in-memory by default)
conn = chdb.connect(":memory:")
# Or use file-based: conn = chdb.connect("test.db")

# Create a cursor
cur = conn.cursor()

# Execute queries
cur.execute("SELECT number, toString(number) as str FROM system.numbers LIMIT 3")

# Fetch data in different ways
print(cur.fetchone())    # Single row: (0, '0')
print(cur.fetchmany(2))  # Multiple rows: ((1, '1'), (2, '2'))

# Get column information
print(cur.column_names())  # ['number', 'str']
print(cur.column_types())  # ['UInt64', 'String']

# Use the cursor as an iterator
cur.execute("SELECT number FROM system.numbers LIMIT 3")
for row in cur:
    print(row)

# Always close resources when done
cur.close()
conn.close()

For more details, see examples/connect.py.

</details> <details> <summary><h4>🗂️ Query On File</h4> (Parquet, CSV, JSON, Arrow, ORC and 60+)</summary>

You can execute SQL and return desired format data.

import chdb
res = chdb.query('select version()', 'Pretty'); print(res)

Work with Parquet or CSV

# See more data type format in tests/format_output.py
res = chdb.query('select * from file("data.parquet", Parquet)', 'JSON'); print(res)
res = chdb.query('select * from file("data.csv", CSV)', 'CSV');  print(res)
print(f"SQL read {res.rows_read()} rows, {res.bytes_read()} bytes, storage read {res.storage_rows_read()} rows, {res.storage_bytes_read()} bytes, elapsed {res.elapsed()} seconds")

Parameterized queries

import chdb

df = chdb.query(
    "SELECT toDate({base_date:String}) + number AS date "
    "FROM numbers({total_days:UInt64}) "
    "LIMIT {items_per_page:UInt64}",
    "DataFrame",
    params={"base_date": "2025-01-01", "total_days": 10, "items_per_page": 2},
)
print(df)
#         date
# 0 2025-01-01
# 1 2025-01-02

Query progress (progress=auto)

import chdb

# Connection API
conn = chdb.connect(":memory:?progress=auto")
conn.query("SELECT sum(number) FROM numbers_mt(1e10) GROUP BY number % 10 SETTINGS max_threads=4")
import chdb

# One-shot API
res = chdb.query(
    "SELECT sum(number) FROM numbers_mt(1e10) GROUP BY number % 10 SETTINGS max_threads=4",
    options={"progress": "auto"},
)

progress=auto behavior:

  • In terminal runs: show textual progress updates in the terminal.
  • Jupyter/Marimo notebook: render progress bar in notebook output.

Other progress options:

  • Progress bar:
    • progress=tty: write progress to terminal TTY.
    • progress=err: write progress to stderr.
    • progress=off: disable progress bar output.
  • Progress table (terminal output):
    • progress-table=tty: write progress table to terminal TTY.
    • progress-table=err: write progress table to stderr.
    • progress-table=off: disable progress table output.

Pandas dataframe output

# See more in https://clickhouse.com/docs/en/interfaces/formats
chdb.query('select * from file("data.parquet", Parquet)', 'Dataframe')
</details> <details> <summary><h4>🗂️ Query On Table</h4> (Pandas DataFrame, Parquet file/bytes, Arrow bytes) </summary>

Query On Pandas DataFrame

import chdb.dataframe as cdf
import pandas as pd
# Join 2 DataFrames
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': ["one", "two", "three"]})
df2 = pd.DataFrame({'c': [1, 2, 3], 'd': ["①", "②", "③"]})
ret_tbl = cdf.query(sql="sele
View on GitHub
GitHub Stars2.6k
CategoryData
Updated11h ago
Forks106

Languages

Python

Security Score

100/100

Audited on Mar 27, 2026

No findings