SkillAgentSearch skills...

LakeBench

A multi-modal Python library for benchmarking lakehouse engines and ELT scenarios, supporting both industry-standard and novel benchmarks.

Install / Use

/learn @microsoft/LakeBench
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

🌊 LakeBench

PyPI Release PyPI Downloads Python version Tests

LakeBench is the first Python-based, multi-modal benchmarking framework designed to evaluate performance across multiple lakehouse compute engines and ELT scenarios. Supporting a variety of engines and both industry-standard and novel benchmarks, LakeBench enables comprehensive, apples-to-apples comparisons in a single, extensible Python library.

🚀 The Mission of LakeBench

LakeBench exists to bring clarity, trust, accessibility, and relevance to engine benchmarking by focusing on four core pillars:

  1. End-to-End ELT Workflows Matter

    Most benchmarks focus solely on analytic queries. But in practice, data engineers manage full data pipelines — loading data, transforming it (in batch, incrementally, or even streaming), maintaining tables, and then querying.

    LakeBench proposes that the entire end-to-end data lifecycle managed by data engineers is relevant, not just queries.

  2. Variety in Benchmarks Is Essential

    Real-world pipelines deal with with different data shapes, sizes, and patterns. One-size-fits-all benchmarks miss this nuance.

    LakeBench covers a variety of benchmarks that represent diverse workloads — from bulk loads to incremental merges to maintenance jobs to ad-hoc queries — providing a richer picture of engine behavior under different conditions.

  3. Consistency Enables Trustworthy Comparisons

    Somehow, every engine claims to be the fastest at the same benchmark, at the same time. Without a standardized framework, with support for many engines, comparisons are hard to trust and even more difficult to reproduce.

    LakeBench ensures consistent methodology across engines, reducing the likelihood of implementation bias and enabling repeatable, trustworthy results. Engine subject matter experts are encouraged to submit PRs to tune code as needed so that their preferred engine is best represented.

  4. Accessibility starts with pip install

    Most benchmarking toolkits are highly inaccessible to the beginner data engineer, requiring the user to build the package or installation via a JAR, absent of Python bindings.

    LakeBench is intentionally built as a Python-native library, installable via pip from PyPi, so it's easy for any engineer to get started—no JVM or compilation required. It's so lightweight and approachable, you could even use it just for generating high-quality sample data.

✅ Why LakeBench?

  • Multi-Engine: Benchmark Spark, DuckDB, Polars, Daft, Sail and others, side-by-side
  • Lifecycle Coverage: Ingest, transform, maintain, and query—just like real workloads
  • Diverse Workloads: Test performance across varied data shapes and operations
  • Consistent Execution: One framework, many engines
  • Extensible by Design: Add engines or additional benchmarks with minimal friction
  • Dataset Generation: Out-of-the box dataset generation for all benchmarks
  • Rich Logs: Automatically logged engine version, compute size, duration, estimated execution cost, etc.

LakeBench empowers data teams to make informed engine decisions based on real workloads, not just marketing claims.

💪 Benchmarks

LakeBench currently supports four benchmarks with more to come:

  • ELTBench: An benchmark that simulates typicaly ELT workloads:
    • Raw data load (Parquet → Delta)
    • Fact table generation
    • Incremental merge processing
    • Table maintenance (e.g. OPTIMIZE/VACUUM)
    • Ad-hoc analytical queries
  • TPC-DS: An industry-standard benchmark for complex analytical queries, featuring 24 source tables and 99 queries. Designed to simulate decision support systems and analytics workloads.
  • TPC-H: Focuses on ad-hoc decision support with 8 tables and 22 queries, evaluating performance on business-oriented analytical workloads.
  • ClickBench: A benchmark that simulates ad-hoc analytical and real-time queries on clickstream, traffic analysis, web analytics, machine-generated data, structured logs, and events data. The load phase (single flat table) is followed by 43 queries.

Planned

  • TPC-DI: An industry-standard benchmark for data integration workloads, evaluating end-to-end ETL/ELT performance across heterogeneous sources—including data ingestion, transformation, and loading processes.

⚙️ Engine Support Matrix

LakeBench supports multiple lakehouse compute engines. Each benchmark scenario declares which engines it supports via <BenchmarkClassName>.BENCHMARK_IMPL_REGISTRY.

| Engine | ELTBench | TPC-DS | TPC-H | ClickBench | |-----------------|:--------:|:------:|:-------:|:----------:| | Spark (Generic) | ✅ | ✅ | ✅ | ✅ | | Fabric Spark | ✅ | ✅ | ✅ | ✅ | | Synapse Spark | ✅ | ✅ | ✅ | ✅ | | HDInsight Spark | ✅ | ✅ | ✅ | ✅ | | DuckDB | ✅ | ✅ | ✅ | ✅ | | Polars | ✅ | ⚠️ | ⚠️ | ⚠️ | | Daft | ✅ | ⚠️ | ⚠️ | ⚠️ | | Sail | ✅ | ✅ | ✅ | ✅ |

Legend:
✅ = Supported
⚠️ = Some queries fail due to syntax issues (i.e. Polars doesn't support SQL non-equi joins, Daft is missing a lot of standard SQL contructs, i.e. DATE_ADD, CROSS JOIN, Subqueries, non-equi joins, CASE with operand, etc.). 🔜 = Coming Soon
(Blank) = Not currently supported

For detailed pass rates and per-query failure analysis, see the coverage reports.

📊 Engine Coverage Reports

Per-engine coverage reports are auto-generated by the integration test suite and show pass rates with individual query failure details.
To refresh: run the integration tests for your engine of choice (see tests/integration/README.md).

| Engine | Report | |--------|--------| | DuckDB | reports/coverage/duckdb.md | | Polars | reports/coverage/polars.md | | Daft | reports/coverage/daft.md | | Spark | reports/coverage/spark.md | | Sail | reports/coverage/sail.md |

Where Can I Run LakeBench?

Multiple modalities doesn't end at just benchmarks and engines, LakeBench also supports different runtimes and storage backends:

Runtimes:

  • Local (Windows)
  • Fabric
  • Synapse
  • HDInsight
  • Google Colab ⚠️

Storage Systems:

  • Local filesystem (Windows)
  • OneLake
  • ADLS gen2 (temporarily only in Fabric, Synapse, and HDInsight)
  • S3 ⚠️
  • GS ⚠️

* ⚠️ denotes experimental storage backends

What Table Formats Are Supported?

LakeBench currently only supports Delta Lake.

🔌 Extensibility by Design

LakeBench is designed to be extensible, both for additional engines and benchmarks.

  • You can register new engines without modifying core benchmark logic.
  • You can add new benchmarks that reuse existing engines and shared engine methods.
  • LakeBench extension libraries can be created to extend core LakeBench capabilities with additional custom benchmarks and engines (i.e. MyCustomSynapseSpark(Spark), MyOrgsELT(BaseBenchmark)).

New engines can be added via subclassing an existing engine class. Existing benchmarks can then register support for additional engines via the below:

from lakebench.benchmarks import TPCDS
TPCDS.register_engine(MyNewEngine, None)

register_engine is a class method to update <BenchmarkClassName>.BENCHMARK_IMPL_REGISTRY. It requires two inputs, the engine class that is being registered and the engine specific benchmark implementation class if required (otherwise specifying None will leverage methods in the generic engine class).

This architecture encourages experimentation, benchmarking innovation, and easy adaptation.

Example:

from lakebench.engines import BaseEngine

class MyCustomEngine(BaseEngine):
    ...

from lakebench.benchmarks.elt_bench import ELTBench
# registering the engine is only required if you aren't subclassing an existing registered engine
ELTBench.register_engine(MyCustomEngine, None)

benchmark = ELTBench(engine=MyCustomEngine(...))
benchmark.run()

Using LakeBench

📦 Installation

Install from PyPi:

pip install lakebench[duckdb,polars,daft,tpcds_datagen,tpch_datagen,sparkmeasure]

Example Usage

To run any LakeBench benchmark, first do a one time generation of the data required for the benchmark and scale of interest. LakeBench provides datagen classes to quickly generate parquet datasets required by the benchmarks.

Data Generation

  • TPC-H data generation is provided via the (tpchgen-rs)[https://github.com/clflushopt/tpchgen-rs] project. The project is currently about 10x+ faster than the next closest method of generating TPC-H datasets. The TPC-DS version of project is currently under development.

    The below are generation runtimes on a 64 v-core VM writing to OneLake. Scale factors below 1000 can easily be generated on a 2 v-core machine. | Scale Factor | Duration (hh:mm:ss)| |:------------:|:------------------:| | 1 | 00:00:04 | | 10 | 00:00:09 | | 100 | 00:01:09 | | 1000 | 00:10:15 |

  • **TPC-DS

View on GitHub
GitHub Stars45
CategoryCustomer
Updated20d ago
Forks14

Languages

Python

Security Score

95/100

Audited on Mar 13, 2026

No findings