SkillAgentSearch skills...

Libcubwt

libcubwt is a library for GPU accelerated suffix array and burrows wheeler transform construction.

Install / Use

/learn @IlyaGrebnov/Libcubwt
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

libcubwt

The libcubwt is a library for fast (see Benchmarks below) GPU-based Burrows-Wheeler transform construction and inversion on the CUDA platform using prefix doubling + Skew/DC3 approach as described in the following papers:

  • Vitaly Osipov, Parallel Suffix Array Construction for Shared Memory Architectures, 2012
  • Leyuan Wang, Sean Baxter, and John D. Owens Fast Parallel Suffix Array on the GPU, 2015
  • Florian Büren, Daniel Jünger, Robin Kobus, Christian Hundt, Bertil Schmidt Suffix Array Construction on Multi-GPU Systems, 2019

Copyright (c) 2022-2025 Ilya Grebnov ilya.grebnov@gmail.com

Introduction

The libcubwt provides simple API to construct Burrows-Wheeler transform from a given string over constant-size alphabet using 20.5n bytes of GPU memory.

  • The libcubwt works with any CUDA compatible GPU, but I recommend SM 8.9+ Ada Lovelace (CUDA 11.8 and later) due to very large L2 cache.
  • The libcubwt is sensitive to fast GPU memory and might not be suitable for some workloads. Please benchmark yourself.

Suffix array computation

  • Starting with version 1.5.0, libcubwt algorithm was changed and no longer supports the computation of suffix arrays or inverse suffix arrays. To compute these arrays, please consider using a version prior to 1.5.0.

License

The libcubwt is released under the Apache License Version 2.0 and is considered suitable for production use. However, no warranty or fitness for a particular purpose is expressed or implied.

Changes

  • August 13, 2025 (1.6.3)
    • Support for CUDA Toolkit 13.0.
  • May 5, 2025 (1.6.2)
    • Support for CUDA Toolkit 12.9 & CCCL 3.0.
  • January 24, 2025 (1.6.1)
    • Support for Blackwell architecture.
  • January 24, 2024 (1.6.0)
    • Inverse Burrows-Wheeler transform.
  • March 24, 2023 (1.5.0)
    • Reduced memory usage and improved performance.
  • February 10, 2023 (1.0.0)
    • Initial public release of the libcubwt.

Examples of APIs (see libcubwt.cuh for complete APIs list)

    /**
    * Allocates storage on the CUDA device that allows reusing allocated memory with each libcubwt operation.
    * @param max_length The maximum length of string to support.
    * @return LIBCUBWT_NO_ERROR if no error occurred, libcubwt error code otherwise.
    */
    int64_t libcubwt_allocate_device_storage(void ** device_storage, int64_t max_length);

    /**
    * Destroys the previously allocated storage on the CUDA device.
    * @param device_storage The previously allocated storage on the CUDA device.
    * @return LIBCUBWT_NO_ERROR if no error occurred, libcubwt error code otherwise.
    */
    int64_t libcubwt_free_device_storage(void * device_storage);

    /**
    * Constructs the Burrows-Wheeler Transform (BWT) of a given string.
    * @param device_storage The previously allocated storage on the CUDA device.
    * @param T [0..n-1] The input string.
    * @param L [0..n-1] The output string (can be T).
    * @param n The length of the input string.
    * @return The primary index if no error occurred, libcubwt error code otherwise.
    */
    int64_t libcubwt_bwt(void * device_storage, const uint8_t * T, uint8_t * L, int64_t n);

    /**
    * Reconstructs the original string from a given burrows-wheeler transformed string (BWT) with primary index.
    * @param device_storage The previously allocated storage on the CUDA device.
    * @param T [0..n-1] The input string.
    * @param U [0..n-1] The output string (can be T).
    * @param n The length of the given string.
    * @param freq [0..255] The input symbol frequency table (can be NULL).
    * @param i The primary index.
    * @return LIBCUBWT_NO_ERROR if no error occurred, libcubwt error code otherwise.
    */
    int64_t libcubwt_unbwt(void * device_storage, const uint8_t * T, uint8_t * U, int64_t n, const int32_t * freq, int32_t i);

Benchmarks

Methodology

  • Input files were capped at 352MB due to GPU memory limit.
  • All timings exclude initialization and memory allocations. However, the time for data transfer in and out of GPU is included.
  • The timings are minimum of five runs measuring multi-threading performance of Burrows-Wheeler transform construction.

Specification (x86-64 architecture)

  • OS: Microsoft Windows 10 Pro 64-Bit
  • MB: MSI MPG Z390M GAMING EDGE AC, PCIe 3.0 x16
  • CPU: Intel Core i7-9700K Processor (12M Cache, 5GHz all cores)
  • RAM: 2x8 GB dual-channel DDR4 (4133 MHz, 17-17-17-37)
  • GPU: ASUS TUF Gaming GeForce RTX 4070 Ti 12GB GDDR6X OC Edition
  • Compiler: Microsoft Visual C++ compiler v19.34.31937 for x64
  • Optimizations (CPU): /MD /DNDEBUG /O2 /GL /arch:AVX2 /openmp
  • Optimizations (GPU): -arch=native --use_fast_math

Silesia Corpus

| file | size | libcubwt 1.5.0 (GPU) | libsais 2.7.1 (CPU) | divsufsort 2.0.2 (CPU) | |:---------------:|:-----------:|:------------------------------:|:------------------------------:|:------------------------------:| | dickens | 10192446 | 0.012 sec ( 844.66 MB/s) | 0.115 sec ( 88.27 MB/s) | 0.317 sec ( 32.12 MB/s) | | mozilla | 51220480 | 0.107 sec ( 479.70 MB/s) | 0.722 sec ( 70.91 MB/s) | 1.296 sec ( 39.51 MB/s) | | mr | 9970564 | 0.044 sec ( 228.12 MB/s) | 0.124 sec ( 80.11 MB/s) | 0.283 sec ( 35.21 MB/s) | | nci | 33553445 | 0.097 sec ( 347.06 MB/s) | 0.378 sec ( 88.67 MB/s) | 1.141 sec ( 29.40 MB/s) | | ooffice | 6152192 | 0.007 sec ( 924.93 MB/s) | 0.085 sec ( 72.66 MB/s) | 0.117 sec ( 52.59 MB/s) | | osdb | 10085684 | 0.012 sec ( 841.83 MB/s) | 0.152 sec ( 66.35 MB/s) | 0.299 sec ( 33.72 MB/s) | | reymont | 6627202 | 0.008 sec ( 834.20 MB/s) | 0.077 sec ( 86.48 MB/s) | 0.199 sec ( 33.23 MB/s) | | samba | 21606400 | 0.044 sec ( 488.45 MB/s) | 0.277 sec ( 77.87 MB/s) | 0.502 sec ( 43.04 MB/s) | | sao | 7251944 | 0.006 sec (1275.49 MB/s) | 0.145 sec ( 50.11 MB/s) | 0.158 sec ( 46.04 MB/s) | | webster | 41458703 | 0.065 sec ( 640.98 MB/s) | 0.603 sec ( 68.71 MB/s) | 1.768 sec ( 23.46 MB/s) | | x-ray | 8474240 | 0.007 sec (1297.70 MB/s) | 0.149 sec ( 56.84 MB/s) | 0.198 sec ( 42.90 MB/s) | | xml | 5345280 | 0.009 sec ( 594.05 MB/s) | 0.054 sec ( 98.87 MB/s) | 0.117 sec ( 45.77 MB/s) |

Large Canterbury Corpus

| file | size | libcubwt 1.5.0 (GPU) | libsais 2.7.1 (CPU) | divsufsort 2.0.2 (CPU) | |:---------------:|:-----------:|:------------------------------:|:------------------------------:|:------------------------------:| | bible.txt | 4047392 | 0.004 sec (1063.62 MB/s) | 0.042 sec ( 97.15 MB/s) | 0.106 sec ( 38.16 MB/s) | | E.coli | 4638690 | 0.006 sec ( 771.33 MB/s) | 0.040 sec ( 116.30 MB/s) | 0.166 sec ( 27.98 MB/s) | | world192.txt | 2473400 | 0.003 sec ( 885.03 MB/s) | 0.027 sec ( 90.86 MB/s) | 0.053 sec ( 46.41 MB/s) |

Maximum Compression Corpus

| file | size | libcubwt 1.5.0 (GPU) | libsais 2.7.1 (CPU) | divsufsort 2.0.2 (CPU) | |:---------------:|:-----------:|:------------------------------:|:------------------------------:|:------------------------------:| | A10.jpg | 842468 | 0.001 sec (1338.95 MB/s) | 0.020 sec ( 42.90 MB/s) | 0.018 sec ( 45.71 MB/s) | | AcroRd32.exe | 3870784 | 0.007 sec ( 577.69 MB/s) | 0.059 sec ( 65.08 MB/s) | 0.070 sec ( 55.50 MB/s) | | english.dic | 465211 | 0.001 sec ( 567.47 MB/s) | 0.007 sec ( 70.86 MB/s) | 0.008 sec ( 54.87 MB/s) | | FlashMX.pdf | 4526946 | 0.004 sec (1104.21 MB/s) | 0.099 sec ( 45.58 MB/s) | 0.086 sec ( 52.37 MB/s) | | FP.LOG | 20617071 | 0.047 sec ( 437.20 MB/s) | 0.261 sec ( 79.07 MB/s) | 0.715 sec ( 28.82 MB/s) | | MSO97.DLL | 3782416 | 0.003 sec (1178.36 MB/s) | 0.061 sec ( 62.34 MB/s) | 0.072 sec ( 52.74 MB/s) | | ohs.doc | 4168192 | 0.006 sec ( 674.87 MB/s) | 0.051 sec ( 81.72 MB/s) | 0.094 sec ( 44.17 MB/s) | | rafale.bmp | 4149414 | 0.004 sec (1060.26 MB/s) | 0.043 sec ( 96.65 MB/s) | 0.083 sec ( 49.81 MB/s) | | vcfiu.hlp | 4121418 | 0.005 sec ( 797.61 MB/s) | 0.046 sec ( 89.91 MB/s) | 0.078 sec ( 52.78 MB/s) | | world95.txt | 2988578 | 0.004 sec ( 836.76 MB/s) | 0.033 sec ( 91.77 MB/s) | 0.067 sec ( 44.32 MB/s) |

Large Text Compression Benchmark Corpus

| file | size | libcubwt 1.5.0 (GPU) | libsais 2.7.1 (CPU) | divsufsort 2.0.2 (CPU) | |:---------------:|:-----------:|:------------------------------:|:------------------------------:|:------------------------------:| | enwik8 | 100000000 | 0.147 sec ( 681.07 MB/s) | 1.633 sec ( 61.22 MB/s) | 4.616 sec ( 21.67 MB/s) | | enwik9 | 369098752 | 0.579 sec ( 637.83 MB/s) | 6.363 sec ( 58.01 MB/s) | 20.633 sec ( 17.89 MB/s) |

The Gauntlet Corpus

| file | size | libcubwt 1.5.0 (GPU) | libsais 2.7.1 (CPU) | divsufsort 2.0.2 (CPU) | |:---------------:|:-----------:|:------------------------------:|:------------------------------:|:------------------------------:| | abac | 200000 | 0.004 sec ( 47.69 MB/s) | 0.002 sec ( 89.59 MB/s) | 0.002 sec ( 105.77 MB/s) | |

View on GitHub
GitHub Stars39
CategoryDevelopment
Updated4mo ago
Forks1

Languages

Cuda

Security Score

92/100

Audited on Nov 26, 2025

No findings