Fwht
High-performance C99 library for computing the Fast Walsh-Hadamard Transform (FWHT)
Install / Use
/learn @hadipourh/FwhtREADME
LibFWHT
High-performance C99 library for computing the Fast Walsh-Hadamard Transform (FWHT), a fundamental tool in cryptanalysis and Boolean function analysis. The library provides multiple backend implementations (vectorized single-threaded CPU, OpenMP, and CUDA) with automatic selection based on problem size, offering optimal performance across different hardware configurations.
<p align="center"> <img src="examples/butterfly.svg" alt="FWHT Butterfly Diagram" width="200"> </p>Key Features
- Multiple Backends: Vectorized CPU (AVX2/SSE2/NEON), OpenMP multi-threading, CUDA GPU acceleration
- Multi-Precision GPU: int32, fp64, fp32 (30× faster), fp16 Tensor Cores (up to 54× faster with PyTorch DLPack) with automatic precision selection
- Automatic Backend Selection: Chooses optimal implementation based on problem size and available hardware
- Memory Efficient: In-place algorithm with
O(log n)stack space, cache-aligned allocations - High Performance:
- CPU: Up to 5 GOps/s with SIMD (AVX2/NEON)
- OpenMP: Near-linear scaling on multi-core systems
- GPU: Up to 1115 GOps/s on RTX 4090 (fp16 Tensor Cores with PyTorch DLPack)
- Persistent GPU contexts eliminate malloc/free overhead (5-10× speedup)
- Vectorized Batch Processing: SIMD-accelerated batch API processes multiple transforms simultaneously (ideal for cryptanalysis)
- Bit-packed Boolean WHT: High-level API to compute WHT from 1-bit packed truth tables (32× memory savings) with CUDA support for n ≤ 64K
- Boolean GPU contexts:
fwht_gpu_boolean_context_{create,compute,destroy}keep packed truth tables on device so repeated S-box transforms skip PCIe transfers andcudaMalloc - Overflow Safety: Optional runtime overflow detection for int32 transforms with
fwht_i32_safe() - Flexible API: In-place transforms, out-of-place helpers, batch processing, Boolean function utilities, device-pointer APIs
- Bit-packed Boolean GPU path:
fwht_boolean_packed_backend(..., FWHT_BACKEND_GPU)keeps inputs packed over PCIe and expands to ±1 on device forn ≤ 64K - Production Ready: Comprehensive test suite, numerical stability guarantees, precision warnings, command-line tool included
- Easy Integration: C99 standard, minimal dependencies, Python bindings available via PyPI with DLPack support
Algorithm
The Fast Walsh-Hadamard Transform computes the Walsh spectrum of k-variable Boolean functions using butterfly operations:
- Time complexity:
O(n log n) = O(k × 2^k)for truth tables of sizen = 2^k - Space complexity: Input/output buffer of size
O(n)(inevitable for storing the signal) with onlyO(1)auxiliary workspace; the production kernels are fully in-place/iterative, so no large temporaries are allocated. - Divide-and-conquer: Recursive with cache-efficient base cases (512-element cutoff fits in L1)
- Multi-backend architecture:
- CPU: SIMD vectorization (AVX2/SSE2/NEON auto-detected), software prefetching, cache-aligned allocations
- OpenMP: Task-based recursive parallelism for multi-core scaling
- GPU: Persistent buffers, async transfers, shared memory kernels
- Auto-tuning: Runtime backend selection based on problem size and hardware
Build and Install
- Prerequisites: C99 compiler,
make; optional OpenMP toolchain; optional CUDA toolkit when GPU support is desired - Default build (library + regression tests):
make - Focused targets:
make lib,make test,make test-gpu,make openmp,make NO_CUDA=1 - CUDA architectures:
makenow emits SASS forsm_70 75 80 86 89 90plus a PTX fallback by default; override withCUDA_ARCH_LIST="80 90" make(or any space-separated list) to target a custom subset, or set it empty to fall back to the historical-arch=sm_80default. - Installation (optional):
sudo make installinstalls headers and libraries into/usr/local
Build outputs are placed in build/ (executables) and lib/ (libraries).
Library Usage
#include <fwht.h>
#include <stdio.h>
int main(void) {
/* Boolean truth table: 0 → +1, 1 → -1 */
int32_t data[8] = {1, -1, -1, 1, -1, 1, 1, -1};
fwht_status_t status = fwht_i32(data, 8);
if (status != FWHT_SUCCESS) {
fprintf(stderr, "%s\n", fwht_error_string(status));
return 1;
}
printf("WHT[0] = %d\n", data[0]);
return 0;
}
Compile with gcc example.c -lfwht -lm (or link directly against libfwht.a in lib/).
Core API Highlights
fwht_i32: in-place transform forint32_tdata (default entry point for Boolean spectra)- Safe for all n if
|input[i]| ≤ 1; general rule:n × max(|input|) < 2^31
- Safe for all n if
fwht_i32_safe: overflow-safe variant with runtime detection (returnsFWHT_ERROR_OVERFLOWon overflow)- ~5-10% slower but guarantees detection of integer overflow
- Use when input magnitudes are large or unknown
fwht_f64: in-place transform fordoubledata when fractional coefficients matter- Relative error typically
< log₂(n) × 2.22e-16(e.g.,< 2e-15for n=2^20)
- Relative error typically
fwht_i8: in-place transform forint8_tdata to minimize memory footprintfwht_boolean_packed: compute WHT directly from a bit-packed Boolean truth table (uint64words); ideal for S-box analysis and now GPU-accelerated viafwht_boolean_packed_backend(..., FWHT_BACKEND_GPU)forn ≤ 65536fwht_boolean_batch: batch processing of multiple bit-packed Boolean functions (vectorial S-box components)- Note: Only safe for
n ≤ 64with|input| = 1(watch for overflow)
- Note: Only safe for
fwht_i32_backend,fwht_f64_backend: same transforms with explicit backend selection (AUTO,CPU,CPU_SAFE,OPENMP,GPU)fwht_compute_i32,fwht_compute_f64: out-of-place transforms returning cache-aligned memory- Important: Must use
fwht_free()instead offree()to deallocate results
- Important: Must use
fwht_from_bool: convert a Boolean truth table to signed Walsh coefficients before transformingfwht_correlations: normalize Walsh coefficients to per-mask correlation valuesfwht_has_gpu,fwht_has_openmp,fwht_backend_name: query runtime capabilities and selected backendfwht_gpu_get_device_name,fwht_gpu_get_compute_capability,fwht_gpu_get_sm_count: query GPU architecture details
GPU Batch Processing and Multi-Precision Support (CUDA)
The library provides comprehensive GPU acceleration with multiple precision modes:
#ifdef USE_CUDA
/* Multi-precision batch processing */
size_t n = 1024;
size_t batch_size = 100;
/* Integer precision (bit-exact) */
int32_t* data_i32 = malloc(n * batch_size * sizeof(int32_t));
fwht_batch_i32_cuda(data_i32, n, batch_size);
/* Float64 precision (cryptographic, ~1e-15 error) */
double* data_f64 = malloc(n * batch_size * sizeof(double));
fwht_batch_f64_cuda(data_f64, n, batch_size);
/* Float32 precision (balanced, 25× faster than fp64) */
float* data_f32 = malloc(n * batch_size * sizeof(float));
fwht_batch_f32_cuda(data_f32, n, batch_size);
/* FP16 Tensor Cores (ML, up to 54× faster with PyTorch DLPack, requires SM 7.0+) */
uint16_t* d_in_fp16; /* Device pointer to fp16 data */
uint16_t* d_out_fp16; /* Device pointer for results */
cudaMalloc(&d_in_fp16, n * batch_size * sizeof(uint16_t));
cudaMalloc(&d_out_fp16, n * batch_size * sizeof(uint16_t));
/* ... copy data to device ... */
int status = fwht_batch_f16_cuda_device(d_in_fp16, d_out_fp16, n, batch_size);
#endif
Device-pointer APIs for zero-copy GPU-resident buffers (no H2D/D2H transfers):
fwht_batch_i32_cuda_device()- int32 on devicefwht_batch_f64_cuda_device()- float64 on devicefwht_batch_f32_cuda_device()- float32 on device (25× faster)fwht_batch_f16_cuda_device()- float16 Tensor Cores (up to 54× faster with PyTorch DLPack)
FP16 Tensor Core Notes:
- Requires NVIDIA GPU with SM 7.0+ (Volta, Turing, Ampere, Ada, Hopper)
- Provides 30-54× speedup vs fp64 (54× with PyTorch DLPack zero-copy) with small precision loss (~12% of results differ by ±1-4)
- Ideal for ML training/inference, NOT for cryptanalysis requiring bit-exact results
- Automatic runtime warning on first use (suppressible via
FWHT_SILENCE_FP16_WARNING=1)
GPU Context API (Persistent Memory)
For applications computing many WHTs repeatedly, persistent GPU contexts eliminate malloc/free overhead:
#ifdef USE_CUDA
/* Create context with max transform size and batch size */
fwht_gpu_context_t* ctx = fwht_gpu_context_create(1024, 100);
/* Reuse pre-allocated GPU memory for many transforms */
for (int i = 0; i < 10000; i++) {
fwht_gpu_context_compute_i32(ctx, data, 256, 1);
}
/* Clean up */
fwht_gpu_context_destroy(ctx);
#endif
Bit-packed Boolean GPU Context
Boolean workflows now have their own lightweight context that keeps the packed bitset and ±1 buffer resident on the device. It mirrors the floating-point context but fixes the batch size at 1 and caps the transform at 64K so the unpack kernel can stay in shared memory.
#ifdef USE_CUDA
fwht_gpu_boolean_context_t* bctx = fwht_gpu_boolean_context_create(32768);
fwht_status_t st = fwht_gpu_boolean_context_compute(bctx, packed_bits, spectrum, 32768);
fwht_gpu_boolean_context_destroy(bctx);
#endif
fwht_boolean_packed_backend(..., FWHT_BACKEND_GPU) uses a shared singleton of this context behind the scenes, so existing call sites automatically skip repeated cudaMalloc/cudaMemcpy traffic and see the same 5–10× reduction in overhead as the float/int context API. Use the explicit context when you need deterministic lifetime control or multiple concurrent contexts per device.
Performance benefit: 5-10× speedup for cryptanalysis workloads with many small transforms.
GPU Configuration and Tuning
#ifdef USE_CUDA
/* Query GPU capabilities */
if (fwht_has_gpu()) {
printf("GPU: %s\n", fwht_gpu_get_device_name
Related Skills
node-connect
349.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
