StringWars
Comparing performance-oriented string-processing libraries for substring search, multi-pattern matching, hashing, edit-distances, sketching, and sorting across CPUs and GPUs in Rust 🦀 and Python 🐍
Install / Use
/learn @ashvardanian/StringWarsREADME
StringWars
Text Processing on CPUs & GPUs, in Python & Rust

There are many great libraries for string processing! Mostly, of course, written in Assembly, C, and C++, but some in Rust as well.
Where Rust decimates C and C++, is the simplicity of dependency management, making it great for benchmarking "Systems Software" and lining up apples-to-apples across native crates and their Python bindings.
So, to accelerate the development of the StringZilla C, C++, and CUDA libraries (with Rust and Python bindings), I've created this repository to compare it against some of my & communities most beloved Rust projects, like:
memchrfor substring search.rapidfuzzandbiofor edit distances and alignments.aHash,xxhash-rust,foldhash, andblake3for hashing.aho_corasickandregexfor multi-pattern search.arrowandpolarsfor collections and sorting.icufor Unicode processing.ringandsodiumoxidefor encryption.
Of course, the functionality of the projects is different, as are the APIs and the usage patterns. So, I focus on the workloads for which StringZilla was designed and compare the throughput of the core operations. Notably, I also favor modern hardware with support for a wider range SIMD instructions, like mask-equipped AVX-512 on x86 starting from the 2015 Intel Skylake-X CPUs or more recent predicated variable-length SVE and SVE2 on Arm, that aren't often supported by existing libraries and tooling.
[!IMPORTANT]
The numbers in the tables below are provided for reference only and may vary depending on the CPU, compiler, dataset, and tokenization method. Most of them were obtained on Intel Sapphire Rapids (SPR) and Granite Rapids (GNR) CPUs and Nvidia Hopper-based H100 and Blackwell-based RTX 6000 Pro GPUs, using Rust with-C target-cpu=nativeoptimization flag. To replicate the results, please refer to the Replicating the Results section below.
Benchmarks at a Glance
Hash
Many hashing libraries exist, but they often lack reproducible outputs, streaming support, or cross-language availability. Throughput on short words and long lines:
Short Words Long Lines
Rust:
stringzilla::hash ████████████████████ 1.84 ████████████████████ 11.38 GB/s
aHash::hash_one █████████████▍ 1.23 ███████████████▏ 8.61 GB/s
xxh3::xxh3_64 ███████████▊ 1.08 ████████████████▋ 9.48 GB/s
std::hash ████▋ 0.43 ██████▌ 3.74 GB/s
Python:
stringzilla.hash ████████████████████ 0.14 ████████████████████ 9.19 GB/s
hash ██████████████████▌ 0.13 █████████▎ 4.27 GB/s
xxhash.xxh3_64 █████▋ 0.04 █████████████▉ 6.38 GB/s
See hash/README.md for details
Case-Insensitive UTF-8 Search
Unicode-aware case-insensitive search with full case folding (ß↔SS, σ↔ς). Throughput searching across ~100MB multilingual corpora:
Rust:
English German
stringzilla ████████████████████ 12.79 ████████████████████ 10.67 GB/s
icu ▏ 0.08 ▏ 0.08 GB/s
Russian Korean
stringzilla ████████████████████ 7.12 ████████████████████ 35.10 GB/s
icu ▏ 0.14 ▏ 0.23 GB/s
Python:
English German
stringzilla ████████████████████ 5.61 ████████████████████ 6.08 GB/s
regex ██▋ 0.77 ███ 0.90 GB/s
Russian Korean
stringzilla ████████████████████ 5.70 ████████████████████ 20.05 GB/s
regex ████████ 2.30 ████▋ 4.59 GB/s
See unicode/README.md for details
Exact Substring Search
Substring search is offloaded to C's memmem or strstr in most languages, but SIMD-optimized implementations can do better.
Throughput on long lines:
Left to right Reverse order
Rust:
memmem::Finder ████████████████████ 10.99
stringzilla ███████████████████▋ 10.82 ████████████████████ 10.66 GB/s
std::str ███████████████████▊ 10.88 ███████████▏ 5.94 GB/s
Python:
stringzilla ████████████████████ 11.79 ████████████████████ 11.56 GB/s
str ██ 1.23 ██████▋ 3.84 GB/s
See find/README.md for details
Byte-Set Search
Searching for character sets (tabs, HTML markup, digits) commonly uses regex or Aho-Corasick automata. Throughput counting all matches on long lines:
Rust:
stringzilla ████████████████████ 8.17 GB/s
regex::find_iter ████████████▊ 5.22 GB/s
aho_corasick █▏ 0.50 GB/s
Python:
stringzilla ████████████████████ 8.79 GB/s
re.finditer ▍ 0.19 GB/s
See find/README.md for details
UTF-8 Processing
Different scripts stress UTF-8 differently: Korean has 3-byte Hangul with single-byte whitespace (representative for tokenization), Arabic uses 2-byte characters, English is mostly 1-byte ASCII. Throughput on AMD Zen5 Turin:
Newline splitting:
English Arabic
stringzilla ████████████████ 15.45 ████████████████████ 18.34 GB/s
stdlib ██ 1.90 ██ 1.82 GB/s
Whitespace splitting:
English Korean
stringzilla ████████████████████ 0.82 ████████████████████ 1.88 GB/s
stdlib ██████████████████▊ 0.77 ██████████▍ 0.98 GB/s
icu::WhiteSpace ██▋ 0.11 █▌ 0.15 GB/s
Case folding on bicameral scripts (Latin, Cyrillic, Greek, Armenian) plus Chinese for reference:
Case folding:
English 16x German 6x
stringzilla ████████████████████ 7.53 ████████████████████ 2.59 GB/s
stdlib ██▌ 0.48 ███▎ 0.43 GB/s
Russian 10x French 5x
stringzilla ████████████████████ 2.20 ████████████████████ 1.84 GB/s
stdlib ██ 0.22 ███▊ 0.35 GB/s
Greek 5x Armenian 4x
stringzilla ████████████████████ 1.00 ████████████████████ 908 MB/s
stdlib ████▍ 0.22 ████▉ 223 MB/s
Vietnamese 1.3x Chinese 4x
stringzilla ████████████████████ 352 ████████████████████ 1.21 GB/s
stdlib █████████████▏ 265 █████▍ 325 MB/s
See unicode/README.md for details
Sequence Operations
Dataframe libraries and search engines rely heavily on string sorting. SIMD-accelerated comparisons and specialized radix sorts can outperform generic algorithms. Throughput on short words:
Rust:
stringzilla ████████████████████ 213.73 M cmp/s
polars::sort ██████████████████▊ 200.34 M cmp/s
arrow::lexsort ███████████▍ 122.20 M cmp/s
std::sort █████ 54.35 M cmp/s
Python:
polars.sort ████████████████████ 223.38 M cmp/s
stringzilla.sorted ███████████████▎ 171.13 M cmp/s
pyarrow.sort █████▌ 62.17 M cmp/s
list.sort ████▏ 47.06 M cmp/s
GPU: cudf on H100 reaches 9,463 M cmp/s on short words.
See sequence/README.md for details
Random Generation
Random byte generation and lookup tables are common in image processing and bioinformatics. Throughput on long lines:
Rust:
stringzilla ████████████████████ 10.57 GB/s
zeroize ████████▉ 4.73 GB/s
rand_xoshiro ███████▎ 3.85 GB/s
Python:
stringzilla ████████████████████ 20.37 GB/s
pycryptodome ████████████▉ 13.16 GB/s
numpy.Philox █▌ 1.59 GB/s
See memory/README.md for details
Similarity Scoring
Edit distance is essential for search engines, data cleaning, NLP, and bioinformatics. It's computationally expensive with O(n*m) complexity, but GPUs and multi-core parallelism help. Levenshtein distance on ~1,000 byte lines (MCUPS = Million Cell Updates Per Second):
Rust:
1 Core 1 Socket
bio::levenshtein █▏ 823
rapidfuzz ████████████████████ 14,316
stringzilla (384x GNR) ██████████████████▎ 13,084 ████████████████████ 3,084,270 MCUPS
stringzilla (B200) ██████▍ 998,620 MCUPS
stringzilla (H100) ██████ 925,890 MCUPS
See [similarities/READM
