Fwht

High-performance C99 library for computing the Fast Walsh-Hadamard Transform (FWHT)

Generate Convert Improve

Install / Use

/learn @hadipourh/Fwht

About this skill

Quality Score

0/100

README

LibFWHT

High-performance C99 library for computing the Fast Walsh-Hadamard Transform (FWHT), a fundamental tool in cryptanalysis and Boolean function analysis. The library provides multiple backend implementations (vectorized single-threaded CPU, OpenMP, and CUDA) with automatic selection based on problem size, offering optimal performance across different hardware configurations.

Key Features

Multiple Backends: Vectorized CPU (AVX2/SSE2/NEON), OpenMP multi-threading, CUDA GPU acceleration
Multi-Precision GPU: int32, fp64, fp32 (30× faster), fp16 Tensor Cores (up to 54× faster with PyTorch DLPack) with automatic precision selection
Automatic Backend Selection: Chooses optimal implementation based on problem size and available hardware
Memory Efficient: In-place algorithm with O(log n) stack space, cache-aligned allocations
High Performance:
- CPU: Up to 5 GOps/s with SIMD (AVX2/NEON)
- OpenMP: Near-linear scaling on multi-core systems
- GPU: Up to 1115 GOps/s on RTX 4090 (fp16 Tensor Cores with PyTorch DLPack)
- Persistent GPU contexts eliminate malloc/free overhead (5-10× speedup)
Vectorized Batch Processing: SIMD-accelerated batch API processes multiple transforms simultaneously (ideal for cryptanalysis)
Bit-packed Boolean WHT: High-level API to compute WHT from 1-bit packed truth tables (32× memory savings) with CUDA support for n ≤ 64K
Boolean GPU contexts: fwht_gpu_boolean_context_{create,compute,destroy} keep packed truth tables on device so repeated S-box transforms skip PCIe transfers and cudaMalloc
Overflow Safety: Optional runtime overflow detection for int32 transforms with fwht_i32_safe()
Flexible API: In-place transforms, out-of-place helpers, batch processing, Boolean function utilities, device-pointer APIs
Bit-packed Boolean GPU path: fwht_boolean_packed_backend(..., FWHT_BACKEND_GPU) keeps inputs packed over PCIe and expands to ±1 on device for n ≤ 64K
Production Ready: Comprehensive test suite, numerical stability guarantees, precision warnings, command-line tool included
Easy Integration: C99 standard, minimal dependencies, Python bindings available via PyPI with DLPack support

Algorithm

The Fast Walsh-Hadamard Transform computes the Walsh spectrum of k-variable Boolean functions using butterfly operations:

Time complexity: O(n log n) = O(k × 2^k) for truth tables of size n = 2^k
Space complexity: Input/output buffer of size O(n) (inevitable for storing the signal) with only O(1) auxiliary workspace; the production kernels are fully in-place/iterative, so no large temporaries are allocated.
Divide-and-conquer: Recursive with cache-efficient base cases (512-element cutoff fits in L1)
Multi-backend architecture:
- CPU: SIMD vectorization (AVX2/SSE2/NEON auto-detected), software prefetching, cache-aligned allocations
- OpenMP: Task-based recursive parallelism for multi-core scaling
- GPU: Persistent buffers, async transfers, shared memory kernels
- Auto-tuning: Runtime backend selection based on problem size and hardware

Build and Install

Prerequisites: C99 compiler, make; optional OpenMP toolchain; optional CUDA toolkit when GPU support is desired
Default build (library + regression tests): make
Focused targets: make lib, make test, make test-gpu, make openmp, make NO_CUDA=1
CUDA architectures: make now emits SASS for sm_70 75 80 86 89 90 plus a PTX fallback by default; override with CUDA_ARCH_LIST="80 90" make (or any space-separated list) to target a custom subset, or set it empty to fall back to the historical -arch=sm_80 default.
Installation (optional): sudo make install installs headers and libraries into /usr/local

Build outputs are placed in build/ (executables) and lib/ (libraries).

Library Usage

#include <fwht.h>
#include <stdio.h>

int main(void) {
    /* Boolean truth table: 0 → +1, 1 → -1 */
    int32_t data[8] = {1, -1, -1, 1, -1, 1, 1, -1};

    fwht_status_t status = fwht_i32(data, 8);
    if (status != FWHT_SUCCESS) {
        fprintf(stderr, "%s\n", fwht_error_string(status));
        return 1;
    }

    printf("WHT[0] = %d\n", data[0]);
    return 0;
}

Compile with gcc example.c -lfwht -lm (or link directly against libfwht.a in lib/).

Core API Highlights

fwht_i32: in-place transform for int32_t data (default entry point for Boolean spectra)
- Safe for all n if |input[i]| ≤ 1; general rule: n × max(|input|) < 2^31
fwht_i32_safe: overflow-safe variant with runtime detection (returns FWHT_ERROR_OVERFLOW on overflow)
- ~5-10% slower but guarantees detection of integer overflow
- Use when input magnitudes are large or unknown
fwht_f64: in-place transform for double data when fractional coefficients matter
- Relative error typically < log₂(n) × 2.22e-16 (e.g., < 2e-15 for n=2^20)
fwht_i8: in-place transform for int8_t data to minimize memory footprint
fwht_boolean_packed: compute WHT directly from a bit-packed Boolean truth table (uint64 words); ideal for S-box analysis and now GPU-accelerated via fwht_boolean_packed_backend(..., FWHT_BACKEND_GPU) for n ≤ 65536
fwht_boolean_batch: batch processing of multiple bit-packed Boolean functions (vectorial S-box components)
- Note: Only safe for n ≤ 64 with |input| = 1 (watch for overflow)
fwht_i32_backend, fwht_f64_backend: same transforms with explicit backend selection (AUTO, CPU, CPU_SAFE, OPENMP, GPU)
fwht_compute_i32, fwht_compute_f64: out-of-place transforms returning cache-aligned memory
- Important: Must use fwht_free() instead of free() to deallocate results
fwht_from_bool: convert a Boolean truth table to signed Walsh coefficients before transforming
fwht_correlations: normalize Walsh coefficients to per-mask correlation values
fwht_has_gpu, fwht_has_openmp, fwht_backend_name: query runtime capabilities and selected backend
fwht_gpu_get_device_name, fwht_gpu_get_compute_capability, fwht_gpu_get_sm_count: query GPU architecture details

GPU Batch Processing and Multi-Precision Support (CUDA)

The library provides comprehensive GPU acceleration with multiple precision modes:

#ifdef USE_CUDA
/* Multi-precision batch processing */
size_t n = 1024;
size_t batch_size = 100;

/* Integer precision (bit-exact) */
int32_t* data_i32 = malloc(n * batch_size * sizeof(int32_t));
fwht_batch_i32_cuda(data_i32, n, batch_size);

/* Float64 precision (cryptographic, ~1e-15 error) */
double* data_f64 = malloc(n * batch_size * sizeof(double));
fwht_batch_f64_cuda(data_f64, n, batch_size);

/* Float32 precision (balanced, 25× faster than fp64) */
float* data_f32 = malloc(n * batch_size * sizeof(float));
fwht_batch_f32_cuda(data_f32, n, batch_size);

/* FP16 Tensor Cores (ML, up to 54× faster with PyTorch DLPack, requires SM 7.0+) */
uint16_t* d_in_fp16;   /* Device pointer to fp16 data */
uint16_t* d_out_fp16;  /* Device pointer for results */
cudaMalloc(&d_in_fp16, n * batch_size * sizeof(uint16_t));
cudaMalloc(&d_out_fp16, n * batch_size * sizeof(uint16_t));
/* ... copy data to device ... */
int status = fwht_batch_f16_cuda_device(d_in_fp16, d_out_fp16, n, batch_size);
#endif

Device-pointer APIs for zero-copy GPU-resident buffers (no H2D/D2H transfers):

fwht_batch_i32_cuda_device() - int32 on device
fwht_batch_f64_cuda_device() - float64 on device
fwht_batch_f32_cuda_device() - float32 on device (25× faster)
fwht_batch_f16_cuda_device() - float16 Tensor Cores (up to 54× faster with PyTorch DLPack)

FP16 Tensor Core Notes:

Requires NVIDIA GPU with SM 7.0+ (Volta, Turing, Ampere, Ada, Hopper)
Provides 30-54× speedup vs fp64 (54× with PyTorch DLPack zero-copy) with small precision loss (~12% of results differ by ±1-4)
Ideal for ML training/inference, NOT for cryptanalysis requiring bit-exact results
Automatic runtime warning on first use (suppressible via FWHT_SILENCE_FP16_WARNING=1)

GPU Context API (Persistent Memory)

For applications computing many WHTs repeatedly, persistent GPU contexts eliminate malloc/free overhead:

#ifdef USE_CUDA
/* Create context with max transform size and batch size */
fwht_gpu_context_t* ctx = fwht_gpu_context_create(1024, 100);

/* Reuse pre-allocated GPU memory for many transforms */
for (int i = 0; i < 10000; i++) {
    fwht_gpu_context_compute_i32(ctx, data, 256, 1);
}

/* Clean up */
fwht_gpu_context_destroy(ctx);
#endif

Bit-packed Boolean GPU Context

Boolean workflows now have their own lightweight context that keeps the packed bitset and ±1 buffer resident on the device. It mirrors the floating-point context but fixes the batch size at 1 and caps the transform at 64K so the unpack kernel can stay in shared memory.

#ifdef USE_CUDA
fwht_gpu_boolean_context_t* bctx = fwht_gpu_boolean_context_create(32768);
fwht_status_t st = fwht_gpu_boolean_context_compute(bctx, packed_bits, spectrum, 32768);
fwht_gpu_boolean_context_destroy(bctx);
#endif

fwht_boolean_packed_backend(..., FWHT_BACKEND_GPU) uses a shared singleton of this context behind the scenes, so existing call sites automatically skip repeated cudaMalloc/cudaMemcpy traffic and see the same 5–10× reduction in overhead as the float/int context API. Use the explicit context when you need deterministic lifetime control or multiple concurrent contexts per device.

Performance benefit: 5-10× speedup for cryptanalysis workloads with many small transforms.

GPU Configuration and Tuning

#ifdef USE_CUDA
/* Query GPU capabilities */
if (fwht_has_gpu()) {
    printf("GPU: %s\n", fwht_gpu_get_device_name

Related Skills

node-connect

349.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。