Unopinionated DataScience Benchmark

A comparison of most commonly used Data-Science Python packages and their alternatives. Generally, those alternatives have identical Python interfaces, but come with Multi-Threaded CPU or even GPU backends, implemented in C++, CUDA, Rust and other low-level languages.

To run the default configuration for every folder - the procedure is similar:

cd x
conda env create -f env.yml
python bench.py

Matrices

For Linear Algebra and Digital Signal Processing we synthetically generate square random matrices, mainly of with single-precision floating point numbers. That is different from the default Pythons float that uses the 64-bit representation, more commonly described as double in C-like languages. Participating packages:

NumPy over BLIS
NumPy over OpenBLAS
NumPy over Intel MKL and One API
CuPy over CuBLAS

Graphs or Networks

For Graph Theoretical and Network Science workloads we pick various commonly used datasets from the Stanford Network Repository. All ranging under 1 MB to over 1 GB and 100 million edges. Participating packages:

NetworkX
RetworkX
CuGraph

Tabular Data: NYC Taxi Rides

We took the NYC Taxi Rides dataset as our primary dataset and run the classical 4-query benchmark on its subsets. Participating packages:

Pandas
Modin
CuDF
Dask-CuDF
SQLite
Apache DataFusion

Udsb

Install / Use

README

Unopinionated DataScience Benchmark

Matrices

Graphs or Networks

Tabular Data: NYC Taxi Rides