Time Series Benchmark Suite (TSBS)

This repo contains code for benchmarking several time series databases, including TimescaleDB, MongoDB, InfluxDB, CrateDB and Cassandra. This code is based on a fork of work initially made public by InfluxDB at https://github.com/influxdata/influxdb-comparisons.

Current databases supported:

Overview

The Time Series Benchmark Suite (TSBS) is a collection of Go programs that are used to generate datasets and then benchmark read and write performance of various databases. The intent is to make the TSBS extensible so that a variety of use cases (e.g., devops, IoT, finance, etc.), query types, and databases can be included and benchmarked. To this end we hope to help prospective database administrators find the best database for their needs and their workloads. Further, if you are the developer of a time series database and want to include your database in the TSBS, feel free to open a pull request to add it!

Current use cases

Currently, TSBS supports two use cases.

Dev ops

A 'dev ops' use case, which comes in two forms. The full form is used to generate, insert, and measure data from 9 'systems' that could be monitored in a real world dev ops scenario (e.g., CPU, memory, disk, etc). Together, these 9 systems generate 100 metrics per reading interval. The alternate form focuses solely on CPU metrics for a simpler, more streamlined use case. This use case generates 10 CPU metrics per reading.

In addition to metric readings, 'tags' (including the location of the host, its operating system, etc) are generated for each host with readings in the dataset. Each unique set of tags identifies one host in the dataset and the number of different hosts generated is defined by the scale flag (see below).

Internet of Things (IoT)

The second use case is meant to simulate the data load in an IoT environment. This use case simulates data streaming from a set of trucks belonging to a fictional trucking company. This use case simulates diagnostic data and metrics from each truck, and introduces environmental factors such as out-of-order data and batch ingestion (for trucks that are offline for a period of time). It also tracks truck metadata and uses this to tie metrics and diagnostics together as part of the query set.

The queries that are generated as part of this use case will cover both real time truck status and analytics that will look at the time series data in an effort to be more predictive about truck behavior. The scale factor with this use case will be based on the number of trucks tracked.

Not all databases implement all use cases. This table below shows which use cases are implemented for each database:

|Database|Dev ops|IoT| |:---|:---:|:---:| |Akumuli|X¹|| |Cassandra|X|| |ClickHouse|X|| |CrateDB|X|| |InfluxDB|X|X| |MongoDB|X| |QuestDB|X|X |SiriDB|X| |TimescaleDB|X|X| |Timestream|X|| |VictoriaMetrics|X²||

¹ Does not support the groupby-orderby-limit query ² Does not support the groupby-orderby-limit, lastpoint, high-cpu-1, high-cpu-all queries

What the TSBS tests

TSBS is used to benchmark bulk load performance and query execution performance. (It currently does not measure concurrent insert and query performance, which is a future priority.) To accomplish this in a fair way, the data to be inserted and the queries to run are pre-generated and native Go clients are used wherever possible to connect to each database (e.g., mgo for MongoDB, aws sdk for Timestream).

Although the data is randomly generated, TSBS data and queries are entirely deterministic. By supplying the same PRNG (pseudo-random number generator) seed to the generation programs, each database is loaded with identical data and queried using identical queries.

Installation

TSBS is a collection of Go programs (with some auxiliary bash and Python scripts). The easiest way to get and install the Go programs is to use go get and then make all to install all binaries:

# Fetch TSBS and its dependencies
$ go get github.com/timescale/tsbs
$ cd $GOPATH/src/github.com/timescale/tsbs
$ make

How to use TSBS

Using TSBS for benchmarking involves 3 phases: data and query generation, data loading/insertion, and query execution.

Data and query generation

So that benchmarking results are not affected by generating data or queries on-the-fly, with TSBS you generate the data and queries you want to benchmark first, and then you can (re-)use it as input to the benchmarking phases.

Data generation

Variables needed:

a use case. E.g., iot (choose from cpu-only, devops, or iot)
a PRNG seed for deterministic generation. E.g., 123
the number of devices / trucks to generate for. E.g., 4000
a start time for the data's timestamps. E.g., 2016-01-01T00:00:00Z
an end time. E.g., 2016-01-04T00:00:00Z
how much time should be between each reading per device, in seconds. E.g., 10s
and which database(s) you want to generate for. E.g., timescaledb (choose from cassandra, clickhouse, cratedb, influx, mongo, questdb, siridb, timescaledb or victoriametrics)

Given the above steps you can now generate a dataset (or multiple datasets, if you chose to generate for multiple databases) that can be used to benchmark data loading of the database(s) chosen using the tsbs_generate_data tool:

$ tsbs_generate_data --use-case="iot" --seed=123 --scale=4000 \
    --timestamp-start="2016-01-01T00:00:00Z" \
    --timestamp-end="2016-01-04T00:00:00Z" \
    --log-interval="10s" --format="timescaledb" \
    | gzip > /tmp/timescaledb-data.gz

# Each additional database would be a separate call.

Note: We pipe the output to gzip to reduce on-disk space. This also requires you to pipe through gunzip when you run your tests.

The example above will generate a pseudo-CSV file that can be used to bulk load data into TimescaleDB. Each database has it's own format of how it stores the data to make it easiest for its corresponding loader to write data. The above configuration will generate just over 100M rows (1B metrics), which is usually a good starting point. Increasing the time period by a day will add an additional ~33M rows so that, e.g., 30 days would yield a billion rows (10B metrics)

IoT use case

The main difference between the iot use case and other use cases is that it generates data which can contain out-of-order, missing, or empty entries to better represent real-life scenarios associated to the use case. Using a specified seed means that we can do this in a deterministic and reproducible way for multiple runs of data generation.

Query generation

Variables needed:

the same use case, seed, # of devices, and start time as used in data generation
an end time that is one second after the end time from data generation. E.g., for 2016-01-04T00:00:00Z use 2016-01-04T00:00:01Z
the number of queries to generate. E.g., 1000
and the type of query you'd like to generate. E.g., single-groupby-1-1-1 or last-loc

For the last step there are numerous queries to choose from, which are listed in Appendix I. Additionally, the file scripts/generate_queries.sh contains a list of all of them as the default value for the environmental variable QUERY_TYPES. If you are generating more than one type of query, we recommend you use the helper script.

For generating just one set of queries for a given type:

$ tsbs_generate_queries --use-case="iot" --seed=123 --scale=4000 \
    --timestamp-start="2016-01-01T00:00:00Z" \
    --timestamp-end="2016-01-04T00:00:01Z" \
    --queries=1000 --query-type="breakdown-frequency" --format="timescaledb" \
    | gzip > /tmp/timescaledb-queries-breakdown-frequency.gz

Note: We pipe the output to gzip to reduce on-disk space. This also requires you to pipe through gunzip when you run your tests.

For generating sets of queries for multiple types:

$ FORMATS="timescaledb" SCALE=4000 SEED=123 \
    TS_START="2016-01-01T00:00:00Z" \
    TS_END="2016-01-04T00:00:01Z" \
    QUERIES=1000 QUERY_TYPES="last-loc low-fuel avg-load" \
    BULK_DATA_DIR="/tmp/bulk_queries" scripts/generate_queries.sh

A full list of query types can be found in Appendix I at the end of this README.

Benchmarking insert/write performance

TSBS has two ways to benchmark insert/write performance:

On the fly simulation and load with tsbs_load
Pre-generate data to a file and load it either with tsbs_load or the db specific executables tsbs_load_*

Using the unified `tsbs_load` executable

The tsbs_load executable can load data in any of the supported databases. It can use a pregenerated data file as input, or simulate the data on the fly.

You first start by generating a config yaml file populated with the default values for each property with:

$ tsbs_load config --target=<db-name> --data-source=[FILE|SIMULATOR]

for example, to generate an example for TimescaleDB, loading the data from file

$ tsbs_load config --target=timescaledb --data-source=FILE
Wrote example config to: ./config.yaml

You can then run tsbs_load with the generated config file with:

$ tsbs_load load timescaledb --config=./config.yaml

For more details on how to use tsbs_load check out the [supplem

Tsbs

Install / Use

README

Time Series Benchmark Suite (TSBS)

Overview

Current use cases

Dev ops

Internet of Things (IoT)

What the TSBS tests

Installation

How to use TSBS

Data and query generation

Data generation

IoT use case

Query generation

Benchmarking insert/write performance

Using the unified `tsbs_load` executable

Tsbs

Install / Use

README

Time Series Benchmark Suite (TSBS)

Overview

Current use cases

Dev ops

Internet of Things (IoT)

What the TSBS tests

Installation

How to use TSBS

Data and query generation

Data generation

IoT use case

Query generation

Benchmarking insert/write performance

Using the unified tsbs_load executable

Using the unified `tsbs_load` executable