StringWars

Text Processing on CPUs & GPUs, in Python & Rust

StringWars Thumbnail

There are many great libraries for string processing! Mostly, of course, written in Assembly, C, and C++, but some in Rust as well.

Where Rust decimates C and C++, is the simplicity of dependency management, making it great for benchmarking "Systems Software" and lining up apples-to-apples across native crates and their Python bindings. So, to accelerate the development of the StringZilla C, C++, and CUDA libraries (with Rust and Python bindings), I've created this repository to compare it against some of my & communities most beloved Rust projects, like:

memchr for substring search.
rapidfuzz and bio for edit distances and alignments.
aHash, xxhash-rust, foldhash, and blake3 for hashing.
aho_corasick and regex for multi-pattern search.
arrow and polars for collections and sorting.
icu for Unicode processing.
ring and sodiumoxide for encryption.

Of course, the functionality of the projects is different, as are the APIs and the usage patterns. So, I focus on the workloads for which StringZilla was designed and compare the throughput of the core operations. Notably, I also favor modern hardware with support for a wider range SIMD instructions, like mask-equipped AVX-512 on x86 starting from the 2015 Intel Skylake-X CPUs or more recent predicated variable-length SVE and SVE2 on Arm, that aren't often supported by existing libraries and tooling.

[!IMPORTANT]
The numbers in the tables below are provided for reference only and may vary depending on the CPU, compiler, dataset, and tokenization method. Most of them were obtained on Intel Sapphire Rapids (SPR) and Granite Rapids (GNR) CPUs and Nvidia Hopper-based H100 and Blackwell-based RTX 6000 Pro GPUs, using Rust with -C target-cpu=native optimization flag. To replicate the results, please refer to the Replicating the Results section below.

Benchmarks at a Glance

Hash

Many hashing libraries exist, but they often lack reproducible outputs, streaming support, or cross-language availability. Throughput on short words and long lines:

                    Short Words                  Long Lines
Rust:
stringzilla::hash   ████████████████████ 1.84    ████████████████████ 11.38 GB/s
aHash::hash_one     █████████████▍       1.23    ███████████████▏      8.61 GB/s
xxh3::xxh3_64       ███████████▊         1.08    ████████████████▋     9.48 GB/s
std::hash           ████▋                0.43    ██████▌               3.74 GB/s

Python:
stringzilla.hash    ████████████████████ 0.14    ████████████████████  9.19 GB/s
hash                ██████████████████▌  0.13    █████████▎            4.27 GB/s
xxhash.xxh3_64      █████▋               0.04    █████████████▉        6.38 GB/s

See hash/README.md for details

Case-Insensitive UTF-8 Search

Unicode-aware case-insensitive search with full case folding (ß↔SS, σ↔ς). Throughput searching across ~100MB multilingual corpora:

Rust:
                      English                      German
stringzilla           ████████████████████ 12.79   ████████████████████ 10.67 GB/s
icu                   ▏                     0.08   ▏                     0.08 GB/s

                      Russian                      Korean
stringzilla           ████████████████████  7.12   ████████████████████ 35.10 GB/s
icu                   ▏                     0.14   ▏                     0.23 GB/s

Python:
                      English                      German
stringzilla           ████████████████████  5.61   ████████████████████  6.08 GB/s
regex                 ██▋                   0.77   ███                   0.90 GB/s

                      Russian                      Korean
stringzilla           ████████████████████  5.70   ████████████████████ 20.05 GB/s
regex                 ████████              2.30   ████▋                 4.59 GB/s

See unicode/README.md for details

Exact Substring Search

Substring search is offloaded to C's memmem or strstr in most languages, but SIMD-optimized implementations can do better. Throughput on long lines:

                    Left to right                Reverse order
Rust:
memmem::Finder      ████████████████████ 10.99
stringzilla         ███████████████████▋ 10.82   ████████████████████ 10.66 GB/s
std::str            ███████████████████▊ 10.88   ███████████▏          5.94 GB/s

Python:
stringzilla         ████████████████████ 11.79   ████████████████████ 11.56 GB/s
str                 ██                    1.23   ██████▋               3.84 GB/s

See find/README.md for details

Byte-Set Search

Searching for character sets (tabs, HTML markup, digits) commonly uses regex or Aho-Corasick automata. Throughput counting all matches on long lines:

Rust:
stringzilla         ████████████████████   8.17 GB/s
regex::find_iter    ████████████▊          5.22 GB/s
aho_corasick        █▏                     0.50 GB/s

Python:
stringzilla         ████████████████████   8.79 GB/s
re.finditer         ▍                      0.19 GB/s

See find/README.md for details

UTF-8 Processing

Different scripts stress UTF-8 differently: Korean has 3-byte Hangul with single-byte whitespace (representative for tokenization), Arabic uses 2-byte characters, English is mostly 1-byte ASCII. Throughput on AMD Zen5 Turin:

Newline splitting:
                      English                     Arabic
stringzilla           ████████████████ 15.45      ████████████████████ 18.34 GB/s
stdlib                ██                1.90      ██                    1.82 GB/s

Whitespace splitting:
                      English                     Korean
stringzilla           ████████████████████ 0.82   ████████████████████ 1.88 GB/s
stdlib                ██████████████████▊  0.77   ██████████▍          0.98 GB/s
icu::WhiteSpace       ██▋                  0.11   █▌                   0.15 GB/s

Case folding on bicameral scripts (Latin, Cyrillic, Greek, Armenian) plus Chinese for reference:

Case folding:
                      English 16x                 German 6x
stringzilla           ████████████████████ 7.53   ████████████████████ 2.59 GB/s
stdlib                ██▌                  0.48   ███▎                 0.43 GB/s

                      Russian 10x                 French 5x
stringzilla           ████████████████████ 2.20   ████████████████████ 1.84 GB/s
stdlib                ██                   0.22   ███▊                 0.35 GB/s

                      Greek 5x                    Armenian 4x
stringzilla           ████████████████████ 1.00   ████████████████████  908 MB/s
stdlib                ████▍                0.22   ████▉                 223 MB/s

                      Vietnamese 1.3x             Chinese 4x
stringzilla           ████████████████████  352   ████████████████████ 1.21 GB/s
stdlib                █████████████▏        265   █████▍                325 MB/s

See unicode/README.md for details

Sequence Operations

Dataframe libraries and search engines rely heavily on string sorting. SIMD-accelerated comparisons and specialized radix sorts can outperform generic algorithms. Throughput on short words:

Rust:
stringzilla         ████████████████████  213.73 M cmp/s
polars::sort        ██████████████████▊   200.34 M cmp/s
arrow::lexsort      ███████████▍          122.20 M cmp/s
std::sort           █████                  54.35 M cmp/s

Python:
polars.sort         ████████████████████  223.38 M cmp/s
stringzilla.sorted  ███████████████▎      171.13 M cmp/s
pyarrow.sort        █████▌                 62.17 M cmp/s
list.sort           ████▏                  47.06 M cmp/s

GPU: cudf on H100 reaches 9,463 M cmp/s on short words.

See sequence/README.md for details

Random Generation

Random byte generation and lookup tables are common in image processing and bioinformatics. Throughput on long lines:

Rust:
stringzilla         ████████████████████  10.57 GB/s
zeroize             ████████▉              4.73 GB/s
rand_xoshiro        ███████▎               3.85 GB/s

Python:
stringzilla         ████████████████████  20.37 GB/s
pycryptodome        ████████████▉         13.16 GB/s
numpy.Philox        █▌                     1.59 GB/s

See memory/README.md for details

Similarity Scoring

Edit distance is essential for search engines, data cleaning, NLP, and bioinformatics. It's computationally expensive with O(n*m) complexity, but GPUs and multi-core parallelism help. Levenshtein distance on ~1,000 byte lines (MCUPS = Million Cell Updates Per Second):

Rust:
                        1 Core                       1 Socket
bio::levenshtein        █▏                      823
rapidfuzz               ████████████████████ 14,316
stringzilla (384x GNR)  ██████████████████▎  13,084  ████████████████████ 3,084,270 MCUPS
stringzilla (B200)                                   ██████▍                998,620 MCUPS
stringzilla (H100)                                   ██████                 925,890 MCUPS

See [similarities/READM

StringWars

Install / Use

README

StringWars

Text Processing on CPUs & GPUs, in Python & Rust

Benchmarks at a Glance

Hash

Case-Insensitive UTF-8 Search

Exact Substring Search

Byte-Set Search

UTF-8 Processing

Sequence Operations

Random Generation

Similarity Scoring