ClickBench

ClickBench: a Benchmark For Analytical Databases

Generate Convert Improve

Install / Use

/learn @ClickHouse/ClickBench

About this skill

Quality Score

0/100

README

ClickBench: a Benchmark For Analytical Databases

https://benchmark.clickhouse.com/

Discussion on Hackernews: https://news.ycombinator.com/item?id=32084571

Overview

This benchmark represents typical workload in the following areas: clickstream and traffic analysis, web analytics, machine-generated data, structured logs, and events data. It covers the typical queries in ad-hoc analytics and real-time dashboards.

The dataset in this benchmark was obtained from the actual traffic recording of one of the world's largest web analytics platforms. It is anonymized while keeping all the essential distributions of the data. The set of queries was improvised to reflect realistic workloads (the queries are not directly from production).

Goals

The main goals of this benchmark are:

Reproducibility

The benchmark allows to reproduce every test result quickly in as little as 20 minutes (although some systems may take several hours) in a semi-automated way. The test setup is documented and uses inexpensive cloud VMs. The test process is documented in the form of a shell script, covering the installation of every system, loading of the data, running the workload, and collecting the result numbers. The dataset is published and made available for download in multiple formats.

Compatibility

The tables and queries use mostly standard SQL and require minimum or no adaptation for most SQL DBMS. The dataset has been filtered to avoid difficulties with parsing and loading.

Diversity

The benchmark process is easy enough to cover a wide range of systems, including:

modern and historical self-managed OLAP DBMS,
traditional OLTP DBMS (for comparison baseline),
managed database-as-a-service offerings,
as well as serverless cloud-native databases,
some NoSQL databases,
document databases,
and specialized time-series databases

for reference, even if they don't specialize on the ClickBench workload.

Realism

The dataset is derived from production data. The realistic data distributions allow to evaluate compression, indices, codecs, custom data structures, etc., something which is not possible with most of the random dataset generators. The workload consists of 43 queries and test the efficiency of full scan and filtered scan, as well as index lookups, and the main relational operations. It can test various aspects of hardware as well: some queries require high storage throughput; some queries benefit from a large number of CPU cores, and some benefit from single-core speed; some queries benefit from high main memory bandwidth.

Limitations

The limitations of this benchmark allow to reproduce it and include more systems in the comparison easily. The benchmark represents only a subset of all possible workloads and scenarios. While it aims to be as fair as possible, focusing on a specific subset of workloads may give an advantage to the systems that specialize in those workloads.

Note these limitations:

The dataset is a single flat table. This is different from classical data warehouses, which use a normalized star or snowflake data model. Therefore, classical data warehouses may have an unfair disadvantage in ClickHouse.
The table consists of exactly 99'997'497 records. This is rather small by modern standards but allows tests to be completed in reasonable time.
While the benchmark allows testing distributed systems, and it includes multi-node and serverless cloud-native setups, most of the results so far have been obtained on single node setups.
The benchmark runs its queries one after another and does not test workloads with concurrent queries, neither does it test for system capacity. Every query is run only a few times. This allows for some variability in the results.
Many setups and systems are different enough to make direct comparison tricky. For example, it is not possible to test the efficiency of storage used for in-memory databases, or the time of data loading for stateless query engines. The goal of the benchmark is to produce numbers. You need to interpret them by your own.

Tl;dr: All Benchmarks Are ~~Bastards~~ Liars.

Rules and Contribution

How To Add a New Result

To add a new entry, copy-paste one of the existing directories and edit the files accordingly:

benchmark.sh: this is the main script which runs the benchmark on a fresh VM; Ubuntu 24.04 or newer should be used by default. For databases that can be installed locally, the script should be able to run in a fully automated manner so it can be used in the benchmark automation (cloud-init). It should output the results in the following format: - one or more lines Load time: 1234 with the time in seconds; - a line Data size: 1234567890 with the data size in bytes; the data size should include indexes and transaction logs if applicable; - 43 consecutive lines in the form of [1.234, 5.678, 9.012], for the runtimes of every query; - the output may include other lines with the logs, that are not used for the report. For managed databases, if the setup requires clicking in a UI, write a README.md instead.
README.md: contains comments and observations if needed. For managed databases, it can describe the setup procedure to be used instead of a shell script.
create.sql: a CREATE TABLE statement. If it's a NoSQL system, another file like wtf.json can be used instead.
queries.sql: contains 43 ClickBench queries to run;
run.sh: a loop that running the queries; every query is run three times, see section "Caching" below for details.
results/: put the .json files with the results for every hardware configuration into this directory. Please double-check that each file is valid JSON (e.g., no comma errors).

To introduce a new result for an existing system for a different hardware configuration, add a new file to results.

To introduce a new result for an existing system with a different usage scenario, either copy the whole directory and name it differently (e.g. timescaledb, timescaledb-compression) or add a new file to the results directory.

index.html can be re-generated using ./generate-results.sh. The CI (GitHub Actions) does this automatically, this step is optional.

All tests were originally run on AWS c6a.4xlarge EC2 VMs with 500 GB gp2 disks. With better automation, more EC2 machines were added later: c6a.2xlarge, c6a.metal, c8g.4xlarge, c6a.xlarge, c7a.metal-48xl, c6a.large, c8g.metal-48xl, and t3a.small. These represent older and modern machines, and small / medium / large systems (CPU and main memory).

Please help us add more systems and run the benchmarks on more types of VMs.

Installation And Fine-Tuning

The systems can be installed or used in any reasonable way: from a binary distribution, from a Docker container, from the package manager, or compiled - whatever is more natural and simple or gives better results.

It's better to use the default settings and avoid fine-tuning. Configuration changes can be applied if it is considered strictly necessary and documented.

Fine-tuning and optimization for the benchmark are not recommended but allowed. In this case, add results for the vanilla configuration and tunes results separately (e.g. 'MyDatabase' and 'MyDatabase-tuned')

Data Loading

The dataset is available as CSV, TSV, JSONlines and Parquet formats by the following links:

https://datasets.clickhouse.com/hits_compatible/hits.csv.gz
https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz
https://datasets.clickhouse.com/hits_compatible/hits.json.gz
https://datasets.clickhouse.com/hits_compatible/hits.parquet

You can select the most optimal dataset format for your database at your discretion.

Additional sources for stateless table engines are provided:

https://datasets.clickhouse.com/hits_compatible/athena/hits.parquet (the same parquet file in its own subdirectory)
https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_{0..99}.parquet (100 files)

The datasets are intentionally "dirty", e.g. the Parquet files have no proper logical data types and no bloom filter indexes. This reflects the real-world nature of the benchmark better (also see https://github.com/ClickHouse/ClickBench/issues/7#issuecomment-3538940698).

To compare insertion times correctly, the dataset should be downloaded and decompressed before loading (if it's using external compression; the parquet file includes internal compression and can be loaded as is). The dataset should be loaded as a single file in the most straightforward way. Splitting the dataset for parallel loading is not recommended, as it will make comparisons more difficult. Splitting the dataset is possible if the system cannot eat it as a whole due to limitations.

You should not wait for cool down after data loading or running OPTIMIZE / VACUUM before the main benchmark queries unless the database strictly requires this.

The used storage size can be measured without accounting for temporary data if there is temporary data that will be removed in the background. Built-in introspection capabilities can be used to measure the storage size, or it can be measured by checking the used space in the filesystem.

Indexing

The benchmark table has one index - the primary key. The primary key is not necessary unique. The index of the primary key can be clustered (table sorting) or non-clustered (additional datastructure, e.g. B-tree or hash-based index).

Manual creation of other indices is not recommended, although if a database creates additional indexes automatically, it is considered ok.

Preaggregation

The creation of pre-aggregated tables or indices, projections, or materialized views is not recommended for the purpose of this benchmark. Although you can add fine-tuned setup and results for reference, they will be out of competition.

If a system is of a "multidimensional OLAP" kind, and so is always or implicitly doing aggregations, it can be a

Related Skills

himalaya

352.0k

CLI to manage emails via IMAP/SMTP. Use `himalaya` to list, read, write, reply, forward, search, and organize emails from the terminal. Supports multiple accounts and message composition with MML (MIME Meta Language).

oracle

352.0k

Best practices for using the oracle CLI (prompt + file bundling, engines, sessions, and file attachment patterns).

taskflow

352.0k

name: taskflow description: Use when work should span one or more detached tasks but still behave like one job with a single owner context. TaskFlow is the durable flow substrate under authoring layer

tmux

352.0k

Remote-control tmux sessions for interactive CLIs by sending keystrokes and scraping pane output.