1BRC: One Billion Row Challenge in Python

Python implementation of Gunnar's 1 billion row challenge:

https://www.morling.dev/blog/one-billion-row-challenge
https://github.com/gunnarmorling/1brc

Creating the measurements file with 1B rows

First install the Python requirements:

python3 -m pip install -r requirements.txt

The script createMeasurements.py will create the measurement file:

usage: createMeasurements.py [-h] [-o OUTPUT] [-r RECORDS]

Create measurement file

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Measurement file name (default is "measurements.txt")
  -r RECORDS, --records RECORDS
                        Number of records to create (default is 1_000_000_000)

Example:

% python3 createMeasurements.py
Creating measurement file 'measurements.txt' with 1,000,000,000 measurements...
100%|█████████████████████████████████████████| 100/100 [01:15<00:00,  1.32it/s]
Created file 'measurements.txt' with 1,000,000,000 measurements in 75.86 seconds

Be patient as it can take more than a minute to have the file generated.

Maybe as another challenge is to speed up the generation of the measurements file :slightly_smiling_face:

Performance (on a MacBook Pro M1 32GB)

| Interpreter | Script | user | system | cpu | total | | ----------- | ------ | ---- | ------ | --- | ----- | | python3 | calculateAveragePolars.py | 77.84 | 3.64 | 703% | 11.585 | | pypy3 | calculateAveragePypy.py | ~~139.15~~ 135.25 | ~~3.02s~~ 2.92 | ~~699%~~ 735% | ~~20.323~~ 18.782 | | python3 | calculateAverageDuckDB.py | 186.78 | 4.21 | 806% | 23.673 | | pypy3 | calculateAverage.py | ~~284.90~~ 242.89 | ~~9.12~~ 6.28 | ~~749%~~ 780% | ~~39.236~~ 31.926 | | python3 | calculateAverage.py | ~~378.54~~ 329.20 | ~~6.94~~ 3.77 | ~~747%~~ 793% | ~~51.544~~ 41.941 | | python3 | calculateAveragePypy.py | ~~573.77~~ 510.93 | ~~2.70~~ 1.88 | ~~787%~~ 793% | ~~73.170~~ 64.660 |

The script calculateAveragePolars.py was suggested by Taufan on this post.

The script calculateAveragePypy.py was created by donalm, a +2x improved version of the initial script (calculateAverage.py) when running in pypy3, even capable of beating the implementation using DuckDB calculateAverageDuckDB.py.

Olivier Scalbert has made a simple but incredible suggestion where performance increased by an average of 15% (table above has been updated), thank you :slightly_smiling_face:

His suggestions were to change from:

if measurement < result[location][0]:
    result[location][0] = measurement
if measurement > result[location][1]:
    result[location][1] = measurement
result[location][2] += measurement
result[location][3] += 1

to:

_result = result[location]
if measurement < _result[0]:
    _result[0] = measurement
if measurement > _result[1]:
    _result[1] = measurement
_result[2] += measurement
_result[3] += 1

Python can be surprising sometimes.

Compare results

Run compare.sh if you want to check that all the scripts produce the same output.

1brc

Install / Use

README

1BRC: One Billion Row Challenge in Python

Creating the measurements file with 1B rows

Performance (on a MacBook Pro M1 32GB)

Compare results