Multians
Massively Parallel ANS Decoding on GPUs
Install / Use
/learn @weissenberger/MultiansREADME
MULTIANS - Massively Parallel ANS Decoding on GPUs
An implementation of a novel algorithm for ANS (Asymmetric Numeral Systems) decoding on GPUs.
For a detailed description of the concept, please refer to our conference paper.
The algorithm is capable of decoding raw (unpartitioned) ANS-encoded datastreams of variable size at extremely high throughput rates.
The method does not require any vendor-specific features. Although this implementation uses the CUDA toolkit, porting it to related parallel programming frameworks, such as OpenCL, should be straightforward.
State count and alphabet size are configurable. At its current increment, the decoder supports input data encoded using a single table and a radix of b = 2 (i.e. encoder emits single bits during renormalization), and alphabet sizes of up to 256 symbols. Another implementation supporting multiple tables / multiple states is subject of future work.
The sourcecode also includes a (very basic) single-state tANS encoder for testing, as well as a multicore-based implementation of the method for comparison with the GPU version.
Requirements
- CUDA-enabled GPU with compute capability 3.0 or higher
- GNU/Linux
- CUDA SDK 9 or higher
- latest proprietary graphics drivers
Compilation process
Configuration
Please edit the Makefile:
Set ARCH to the compute capability of your GPU, i.e. ARCH = 35 for compute capability 3.5. If you'd like to compile the decoder for multiple generations of GPUs, please edit NVCC_FLAGS accordingly.
Test program
The test program will generate multiple random datasets (256 symbols) of user-specified size. The symbols are exponentially distributed with increasing rate parameters (λ), yielding different compression ratios for different sets.
For each dataset, the program will:
- encode the data into a single compressed stream using tANS
- copy / decode the compressed data on a specified GPU
- decode the compressed data using a specified number of CPU threads
- print the time elapsed for each decoding process
Compiling the test program
To compile the test program, configure the Makefile as described above. Run:
make
Running the test program
./bin/demo <compute device index> <size of input in megabytes> <number of CPU threads>
Related Skills
node-connect
352.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
