Libcubwt
libcubwt is a library for GPU accelerated suffix array and burrows wheeler transform construction.
Install / Use
/learn @IlyaGrebnov/LibcubwtREADME
libcubwt
The libcubwt is a library for fast (see Benchmarks below) GPU-based Burrows-Wheeler transform construction and inversion on the CUDA platform using prefix doubling + Skew/DC3 approach as described in the following papers:
- Vitaly Osipov, Parallel Suffix Array Construction for Shared Memory Architectures, 2012
- Leyuan Wang, Sean Baxter, and John D. Owens Fast Parallel Suffix Array on the GPU, 2015
- Florian Büren, Daniel Jünger, Robin Kobus, Christian Hundt, Bertil Schmidt Suffix Array Construction on Multi-GPU Systems, 2019
Copyright (c) 2022-2025 Ilya Grebnov ilya.grebnov@gmail.com
Introduction
The libcubwt provides simple API to construct Burrows-Wheeler transform from a given string over constant-size alphabet using 20.5n bytes of GPU memory.
- The libcubwt works with any CUDA compatible GPU, but I recommend SM 8.9+ Ada Lovelace (CUDA 11.8 and later) due to very large L2 cache.
- The libcubwt is sensitive to fast GPU memory and might not be suitable for some workloads. Please benchmark yourself.
Suffix array computation
- Starting with version 1.5.0, libcubwt algorithm was changed and no longer supports the computation of suffix arrays or inverse suffix arrays. To compute these arrays, please consider using a version prior to 1.5.0.
License
The libcubwt is released under the Apache License Version 2.0 and is considered suitable for production use. However, no warranty or fitness for a particular purpose is expressed or implied.
Changes
- August 13, 2025 (1.6.3)
- Support for CUDA Toolkit 13.0.
- May 5, 2025 (1.6.2)
- Support for CUDA Toolkit 12.9 & CCCL 3.0.
- January 24, 2025 (1.6.1)
- Support for Blackwell architecture.
- January 24, 2024 (1.6.0)
- Inverse Burrows-Wheeler transform.
- March 24, 2023 (1.5.0)
- Reduced memory usage and improved performance.
- February 10, 2023 (1.0.0)
- Initial public release of the libcubwt.
Examples of APIs (see libcubwt.cuh for complete APIs list)
/**
* Allocates storage on the CUDA device that allows reusing allocated memory with each libcubwt operation.
* @param max_length The maximum length of string to support.
* @return LIBCUBWT_NO_ERROR if no error occurred, libcubwt error code otherwise.
*/
int64_t libcubwt_allocate_device_storage(void ** device_storage, int64_t max_length);
/**
* Destroys the previously allocated storage on the CUDA device.
* @param device_storage The previously allocated storage on the CUDA device.
* @return LIBCUBWT_NO_ERROR if no error occurred, libcubwt error code otherwise.
*/
int64_t libcubwt_free_device_storage(void * device_storage);
/**
* Constructs the Burrows-Wheeler Transform (BWT) of a given string.
* @param device_storage The previously allocated storage on the CUDA device.
* @param T [0..n-1] The input string.
* @param L [0..n-1] The output string (can be T).
* @param n The length of the input string.
* @return The primary index if no error occurred, libcubwt error code otherwise.
*/
int64_t libcubwt_bwt(void * device_storage, const uint8_t * T, uint8_t * L, int64_t n);
/**
* Reconstructs the original string from a given burrows-wheeler transformed string (BWT) with primary index.
* @param device_storage The previously allocated storage on the CUDA device.
* @param T [0..n-1] The input string.
* @param U [0..n-1] The output string (can be T).
* @param n The length of the given string.
* @param freq [0..255] The input symbol frequency table (can be NULL).
* @param i The primary index.
* @return LIBCUBWT_NO_ERROR if no error occurred, libcubwt error code otherwise.
*/
int64_t libcubwt_unbwt(void * device_storage, const uint8_t * T, uint8_t * U, int64_t n, const int32_t * freq, int32_t i);
Benchmarks
Methodology
- Input files were capped at 352MB due to GPU memory limit.
- All timings exclude initialization and memory allocations. However, the time for data transfer in and out of GPU is included.
- The timings are minimum of five runs measuring multi-threading performance of Burrows-Wheeler transform construction.
Specification (x86-64 architecture)
- OS: Microsoft Windows 10 Pro 64-Bit
- MB: MSI MPG Z390M GAMING EDGE AC, PCIe 3.0 x16
- CPU: Intel Core i7-9700K Processor (12M Cache, 5GHz all cores)
- RAM: 2x8 GB dual-channel DDR4 (4133 MHz, 17-17-17-37)
- GPU: ASUS TUF Gaming GeForce RTX 4070 Ti 12GB GDDR6X OC Edition
- Compiler: Microsoft Visual C++ compiler v19.34.31937 for x64
- Optimizations (CPU): /MD /DNDEBUG /O2 /GL /arch:AVX2 /openmp
- Optimizations (GPU): -arch=native --use_fast_math
Silesia Corpus
| file | size | libcubwt 1.5.0 (GPU) | libsais 2.7.1 (CPU) | divsufsort 2.0.2 (CPU) | |:---------------:|:-----------:|:------------------------------:|:------------------------------:|:------------------------------:| | dickens | 10192446 | 0.012 sec ( 844.66 MB/s) | 0.115 sec ( 88.27 MB/s) | 0.317 sec ( 32.12 MB/s) | | mozilla | 51220480 | 0.107 sec ( 479.70 MB/s) | 0.722 sec ( 70.91 MB/s) | 1.296 sec ( 39.51 MB/s) | | mr | 9970564 | 0.044 sec ( 228.12 MB/s) | 0.124 sec ( 80.11 MB/s) | 0.283 sec ( 35.21 MB/s) | | nci | 33553445 | 0.097 sec ( 347.06 MB/s) | 0.378 sec ( 88.67 MB/s) | 1.141 sec ( 29.40 MB/s) | | ooffice | 6152192 | 0.007 sec ( 924.93 MB/s) | 0.085 sec ( 72.66 MB/s) | 0.117 sec ( 52.59 MB/s) | | osdb | 10085684 | 0.012 sec ( 841.83 MB/s) | 0.152 sec ( 66.35 MB/s) | 0.299 sec ( 33.72 MB/s) | | reymont | 6627202 | 0.008 sec ( 834.20 MB/s) | 0.077 sec ( 86.48 MB/s) | 0.199 sec ( 33.23 MB/s) | | samba | 21606400 | 0.044 sec ( 488.45 MB/s) | 0.277 sec ( 77.87 MB/s) | 0.502 sec ( 43.04 MB/s) | | sao | 7251944 | 0.006 sec (1275.49 MB/s) | 0.145 sec ( 50.11 MB/s) | 0.158 sec ( 46.04 MB/s) | | webster | 41458703 | 0.065 sec ( 640.98 MB/s) | 0.603 sec ( 68.71 MB/s) | 1.768 sec ( 23.46 MB/s) | | x-ray | 8474240 | 0.007 sec (1297.70 MB/s) | 0.149 sec ( 56.84 MB/s) | 0.198 sec ( 42.90 MB/s) | | xml | 5345280 | 0.009 sec ( 594.05 MB/s) | 0.054 sec ( 98.87 MB/s) | 0.117 sec ( 45.77 MB/s) |
Large Canterbury Corpus
| file | size | libcubwt 1.5.0 (GPU) | libsais 2.7.1 (CPU) | divsufsort 2.0.2 (CPU) | |:---------------:|:-----------:|:------------------------------:|:------------------------------:|:------------------------------:| | bible.txt | 4047392 | 0.004 sec (1063.62 MB/s) | 0.042 sec ( 97.15 MB/s) | 0.106 sec ( 38.16 MB/s) | | E.coli | 4638690 | 0.006 sec ( 771.33 MB/s) | 0.040 sec ( 116.30 MB/s) | 0.166 sec ( 27.98 MB/s) | | world192.txt | 2473400 | 0.003 sec ( 885.03 MB/s) | 0.027 sec ( 90.86 MB/s) | 0.053 sec ( 46.41 MB/s) |
Maximum Compression Corpus
| file | size | libcubwt 1.5.0 (GPU) | libsais 2.7.1 (CPU) | divsufsort 2.0.2 (CPU) | |:---------------:|:-----------:|:------------------------------:|:------------------------------:|:------------------------------:| | A10.jpg | 842468 | 0.001 sec (1338.95 MB/s) | 0.020 sec ( 42.90 MB/s) | 0.018 sec ( 45.71 MB/s) | | AcroRd32.exe | 3870784 | 0.007 sec ( 577.69 MB/s) | 0.059 sec ( 65.08 MB/s) | 0.070 sec ( 55.50 MB/s) | | english.dic | 465211 | 0.001 sec ( 567.47 MB/s) | 0.007 sec ( 70.86 MB/s) | 0.008 sec ( 54.87 MB/s) | | FlashMX.pdf | 4526946 | 0.004 sec (1104.21 MB/s) | 0.099 sec ( 45.58 MB/s) | 0.086 sec ( 52.37 MB/s) | | FP.LOG | 20617071 | 0.047 sec ( 437.20 MB/s) | 0.261 sec ( 79.07 MB/s) | 0.715 sec ( 28.82 MB/s) | | MSO97.DLL | 3782416 | 0.003 sec (1178.36 MB/s) | 0.061 sec ( 62.34 MB/s) | 0.072 sec ( 52.74 MB/s) | | ohs.doc | 4168192 | 0.006 sec ( 674.87 MB/s) | 0.051 sec ( 81.72 MB/s) | 0.094 sec ( 44.17 MB/s) | | rafale.bmp | 4149414 | 0.004 sec (1060.26 MB/s) | 0.043 sec ( 96.65 MB/s) | 0.083 sec ( 49.81 MB/s) | | vcfiu.hlp | 4121418 | 0.005 sec ( 797.61 MB/s) | 0.046 sec ( 89.91 MB/s) | 0.078 sec ( 52.78 MB/s) | | world95.txt | 2988578 | 0.004 sec ( 836.76 MB/s) | 0.033 sec ( 91.77 MB/s) | 0.067 sec ( 44.32 MB/s) |
Large Text Compression Benchmark Corpus
| file | size | libcubwt 1.5.0 (GPU) | libsais 2.7.1 (CPU) | divsufsort 2.0.2 (CPU) | |:---------------:|:-----------:|:------------------------------:|:------------------------------:|:------------------------------:| | enwik8 | 100000000 | 0.147 sec ( 681.07 MB/s) | 1.633 sec ( 61.22 MB/s) | 4.616 sec ( 21.67 MB/s) | | enwik9 | 369098752 | 0.579 sec ( 637.83 MB/s) | 6.363 sec ( 58.01 MB/s) | 20.633 sec ( 17.89 MB/s) |
The Gauntlet Corpus
| file | size | libcubwt 1.5.0 (GPU) | libsais 2.7.1 (CPU) | divsufsort 2.0.2 (CPU) | |:---------------:|:-----------:|:------------------------------:|:------------------------------:|:------------------------------:| | abac | 200000 | 0.004 sec ( 47.69 MB/s) | 0.002 sec ( 89.59 MB/s) | 0.002 sec ( 105.77 MB/s) | |
