StormBitmaps

Fast algorithms for computing XX^T for binary matrices

Generate Convert Improve

Install / Use

/learn @mklarqvist/StormBitmaps

About this skill

Quality Score

0/100

README

Storm bitmaps

These algorithms and bitmaps are used to compute XXT for a binary input matrix with dimensions (N,M) using specialized CPU instructions i.e. POPCNT, SSE4.2, AVX2, AVX512BW, NEON. This is equivalent to computing the all-vs-all set intersection cardinality (|Xi ∩ XTj|) for pairs of symmetric integer sets. These algorithms are fast in the worst case and extremely fast when the input matrix is sparse.

screenshot

Using large registers (AVX-512BW), contiguous and aligned memory, and cache-aware blocking, we can achieve ~114 GB/s (~0.2 CPU cycles / 64-bit word) of sustained throughput (~14 billion 64-bit bitmaps / second or up to ~912 billion implicit integers / second) using STORM_contiguous_t when the input data is small (N < 256,000). When input data is large, we can achieve around 0.4-0.6 CPU cycles / 64-bit word using STORM_t while using considerably less memory. Both of these models make use of scalar-bitmap or scalar-scalar comparisons when the data density is small. Storm selects the optimal memory alignment and subroutines given the available SIMD instruction at run-time by using libalgebra.

The core algorithms are described in the papers:

Faster Population Counts using AVX2 Instructions by Daniel Lemire, Nathan Kurz and Wojciech Muła (23 Nov 2016).
Efficient Computation of Positional Population Counts Using SIMD Instructions, by Marcus D. R. Klarqvist, Wojciech Muła, and Daniel Lemire (upcoming)
Consistently faster and smaller compressed bitmaps with Roaring by D. Lemire, G. Ssi-Yan-Kai, and O. Kaser (21 Mar 2016).

Performance

All performance tests were run on a host machine with a 10 nm Cannon Lake Core i3-8121U with gcc (GCC) 8.2.1 20180905 (Red Hat 8.2.1-3). Detailed benchmarking requires the Linux perf subsystem. In all the examples below we do not output the result matrix but instead report its total sum. This was done to restrict our measurements to algorithmic performance while being unaffacted by disk I/O.

Small input matrix

Sample performance metrics (practical upper limit) for dense matrices using a host machine with AVX512BW available. We simulate many data arrays in aligned memory and compute the upper triangular or XXT using the command benchmark 65536 10000

| Set bits | CPU cycles / 64-bit word | MB/s | |----------|--------------------------|-----------| | 32768 | 0.209 | 109915 | | 16384 | 0.21 | 113591 | | 6553 | 0.21 | 114524 | | 2621 | 0.21 | 114256 | | 1310 | 0.21 | 114625 | | 655 | 0.21 | 114709 | | 262 | 0.21 | 114659 | | 65 | 0.21 | 114390 | | 13 | 0.21 | 114726 | | 5 | 0.21 | 114574 | | 1 | 0.21 | 114457 |

Large input matrix

Next we simulated 10,000 vectors with [0,262144] random bits set for each to form a (10000,524288)-dimension matrix with different data densities. The following (misleading) figures represent CPU cycles / 64-bit word equivalent if the analysis was run in uncompressed bitmap space. We compare our simple approach to Roaring bitmaps demonstrating we can achieve performance parity on large bitmaps:

| Set bits | Storm-CPU-cycles | Roaring-CPU-cycles | |----------|------------------|--------------------| | 262144 | 0.483 | 0.567 | | 131072 | 0.477 | 0.567 | | 52428 | 0.474 | 0.567 | | 20971 | 3.431 | 3.342 | | 10485 | 1.765 | 1.756 | | 5242 | 0.935 | 0.945 | | 2097 | 0.43 | 0.451 | | 524 | 0.19 | 0.196 | | 104 | 0.117 | 0.133 | | 5 | 0.017 | 0.029 | | 1 | 0.003 | 0.004 |

API

The interface is found in the file storm.h.

#include "storm.h"

int intcmp(const void* aa, const void* bb) {
    const uint32_t* a = aa, *b = bb;
    return (*a < *b) ? 0 : (*a > *b);
}

int main() {
    uint32_t n_vectors = 10000; // number of rows
    uint32_t n_width = 1000; // number of columns
    uint32_t n_bits_set = 128; // number of bits set per row

    STORM_t* storm = STORM_new(); // STORM model
    STORM_contiguous_t* storm_cont = STORM_contig_new(n_width); // STORM contiguous memory
    uint32_t* vals = malloc(n_bits_set*sizeof(uint32_t)); // array for random values

    // Foreach vector (row)
    for (int i = 0; i < n_vectors; ++i) {
        // Generate random bits for each row
        for (int j = 0; j < n_bits_set; ++j) {
            vals[j] = rand() % n_width; // draw random number
        }
        // Input data must be sorted.
        qsort(vals, n_bits_set, sizeof(uint32_t), intcmp);
        
        // Add data to either model.
        STORM_add(storm, &vals[0], n_bits_set);
        STORM_contig_add(storm_cont, &vals[0], n_bits_set);
    }

    uint64_t cstorm      = STORM_pairw_intersect_cardinality(storm);
    uint64_t cstorm_cont = STORM_contig_pairw_intersect_cardinality(storm_cont);

    printf("contig=%llu storm=%llu\n",cstorm_cont,cstorm);

    free(vals);
    STORM_free(storm);
    STORM_contig_free(storm_cont);
    return 1;
}

Compilation

StormBitmaps can be compiled using the standard cmake workflow:

cmake .
make

As with all cmake projects, you can specify the compilers you wish to use by adding (for example) -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ to the cmake command line. We tell the compiler to target the architecture of the build machine by using the -march=native flag. This can be disabled by passing the -DSTORM_DISABLE_NATIVE=ON argument to cmake.

On Linux and MacOSX, and when running with the native compilation flag, we do not need to specify the target hardware instructions set. This is not the case on Windows where we need to set these flags:

| CMake Flag | Description | |--------------------------|----------------------------| | STORM_ENABLE_SIMD_AVX512 | Enable AVX512 instructions | | STORM_ENABLE_SIMD_AVX2 | Enable AVX256 instructions | | STORM_ENABLE_SIMD_SSE4_2 | Enable SSE4.2 instructions |

For example, we can run cmake -DSTORM_ENABLE_SIMD_SSE4_2="ON" . to enable SSE4.2 instructions.

and run ./benchmark.

Note

This is a collaborative effort between Marcus D. R. Klarqvist (@mklarqvist) and Daniel Lemire (@lemire).

History

These functions were originally developed for Tomahawk for computing genome-wide linkage-disequilibrium but can be applied to any intersect-count problem.

Related Skills

node-connect

344.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

96.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。