SkillAgentSearch skills...

Fwht

High-performance C99 library for computing the Fast Walsh-Hadamard Transform (FWHT)

Install / Use

/learn @hadipourh/Fwht
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

LibFWHT

License: GPL v3

High-performance C99 library for computing the Fast Walsh-Hadamard Transform (FWHT), a fundamental tool in cryptanalysis and Boolean function analysis. The library provides multiple backend implementations (vectorized single-threaded CPU, OpenMP, and CUDA) with automatic selection based on problem size, offering optimal performance across different hardware configurations.

<p align="center"> <img src="examples/butterfly.svg" alt="FWHT Butterfly Diagram" width="200"> </p>

Key Features

  • Multiple Backends: Vectorized CPU (AVX2/SSE2/NEON), OpenMP multi-threading, CUDA GPU acceleration
  • Multi-Precision GPU: int32, fp64, fp32 (30× faster), fp16 Tensor Cores (up to 54× faster with PyTorch DLPack) with automatic precision selection
  • Automatic Backend Selection: Chooses optimal implementation based on problem size and available hardware
  • Memory Efficient: In-place algorithm with O(log n) stack space, cache-aligned allocations
  • High Performance:
    • CPU: Up to 5 GOps/s with SIMD (AVX2/NEON)
    • OpenMP: Near-linear scaling on multi-core systems
    • GPU: Up to 1115 GOps/s on RTX 4090 (fp16 Tensor Cores with PyTorch DLPack)
    • Persistent GPU contexts eliminate malloc/free overhead (5-10× speedup)
  • Vectorized Batch Processing: SIMD-accelerated batch API processes multiple transforms simultaneously (ideal for cryptanalysis)
  • Bit-packed Boolean WHT: High-level API to compute WHT from 1-bit packed truth tables (32× memory savings) with CUDA support for n ≤ 64K
  • Boolean GPU contexts: fwht_gpu_boolean_context_{create,compute,destroy} keep packed truth tables on device so repeated S-box transforms skip PCIe transfers and cudaMalloc
  • Overflow Safety: Optional runtime overflow detection for int32 transforms with fwht_i32_safe()
  • Flexible API: In-place transforms, out-of-place helpers, batch processing, Boolean function utilities, device-pointer APIs
  • Bit-packed Boolean GPU path: fwht_boolean_packed_backend(..., FWHT_BACKEND_GPU) keeps inputs packed over PCIe and expands to ±1 on device for n ≤ 64K
  • Production Ready: Comprehensive test suite, numerical stability guarantees, precision warnings, command-line tool included
  • Easy Integration: C99 standard, minimal dependencies, Python bindings available via PyPI with DLPack support

Algorithm

The Fast Walsh-Hadamard Transform computes the Walsh spectrum of k-variable Boolean functions using butterfly operations:

  • Time complexity: O(n log n) = O(k × 2^k) for truth tables of size n = 2^k
  • Space complexity: Input/output buffer of size O(n) (inevitable for storing the signal) with only O(1) auxiliary workspace; the production kernels are fully in-place/iterative, so no large temporaries are allocated.
  • Divide-and-conquer: Recursive with cache-efficient base cases (512-element cutoff fits in L1)
  • Multi-backend architecture:
    • CPU: SIMD vectorization (AVX2/SSE2/NEON auto-detected), software prefetching, cache-aligned allocations
    • OpenMP: Task-based recursive parallelism for multi-core scaling
    • GPU: Persistent buffers, async transfers, shared memory kernels
    • Auto-tuning: Runtime backend selection based on problem size and hardware

Build and Install

  • Prerequisites: C99 compiler, make; optional OpenMP toolchain; optional CUDA toolkit when GPU support is desired
  • Default build (library + regression tests): make
  • Focused targets: make lib, make test, make test-gpu, make openmp, make NO_CUDA=1
  • CUDA architectures: make now emits SASS for sm_70 75 80 86 89 90 plus a PTX fallback by default; override with CUDA_ARCH_LIST="80 90" make (or any space-separated list) to target a custom subset, or set it empty to fall back to the historical -arch=sm_80 default.
  • Installation (optional): sudo make install installs headers and libraries into /usr/local

Build outputs are placed in build/ (executables) and lib/ (libraries).

Library Usage

#include <fwht.h>
#include <stdio.h>

int main(void) {
    /* Boolean truth table: 0 → +1, 1 → -1 */
    int32_t data[8] = {1, -1, -1, 1, -1, 1, 1, -1};

    fwht_status_t status = fwht_i32(data, 8);
    if (status != FWHT_SUCCESS) {
        fprintf(stderr, "%s\n", fwht_error_string(status));
        return 1;
    }

    printf("WHT[0] = %d\n", data[0]);
    return 0;
}

Compile with gcc example.c -lfwht -lm (or link directly against libfwht.a in lib/).

Core API Highlights

  • fwht_i32: in-place transform for int32_t data (default entry point for Boolean spectra)
    • Safe for all n if |input[i]| ≤ 1; general rule: n × max(|input|) < 2^31
  • fwht_i32_safe: overflow-safe variant with runtime detection (returns FWHT_ERROR_OVERFLOW on overflow)
    • ~5-10% slower but guarantees detection of integer overflow
    • Use when input magnitudes are large or unknown
  • fwht_f64: in-place transform for double data when fractional coefficients matter
    • Relative error typically < log₂(n) × 2.22e-16 (e.g., < 2e-15 for n=2^20)
  • fwht_i8: in-place transform for int8_t data to minimize memory footprint
  • fwht_boolean_packed: compute WHT directly from a bit-packed Boolean truth table (uint64 words); ideal for S-box analysis and now GPU-accelerated via fwht_boolean_packed_backend(..., FWHT_BACKEND_GPU) for n ≤ 65536
  • fwht_boolean_batch: batch processing of multiple bit-packed Boolean functions (vectorial S-box components)
    • Note: Only safe for n ≤ 64 with |input| = 1 (watch for overflow)
  • fwht_i32_backend, fwht_f64_backend: same transforms with explicit backend selection (AUTO, CPU, CPU_SAFE, OPENMP, GPU)
  • fwht_compute_i32, fwht_compute_f64: out-of-place transforms returning cache-aligned memory
    • Important: Must use fwht_free() instead of free() to deallocate results
  • fwht_from_bool: convert a Boolean truth table to signed Walsh coefficients before transforming
  • fwht_correlations: normalize Walsh coefficients to per-mask correlation values
  • fwht_has_gpu, fwht_has_openmp, fwht_backend_name: query runtime capabilities and selected backend
  • fwht_gpu_get_device_name, fwht_gpu_get_compute_capability, fwht_gpu_get_sm_count: query GPU architecture details

GPU Batch Processing and Multi-Precision Support (CUDA)

The library provides comprehensive GPU acceleration with multiple precision modes:

#ifdef USE_CUDA
/* Multi-precision batch processing */
size_t n = 1024;
size_t batch_size = 100;

/* Integer precision (bit-exact) */
int32_t* data_i32 = malloc(n * batch_size * sizeof(int32_t));
fwht_batch_i32_cuda(data_i32, n, batch_size);

/* Float64 precision (cryptographic, ~1e-15 error) */
double* data_f64 = malloc(n * batch_size * sizeof(double));
fwht_batch_f64_cuda(data_f64, n, batch_size);

/* Float32 precision (balanced, 25× faster than fp64) */
float* data_f32 = malloc(n * batch_size * sizeof(float));
fwht_batch_f32_cuda(data_f32, n, batch_size);

/* FP16 Tensor Cores (ML, up to 54× faster with PyTorch DLPack, requires SM 7.0+) */
uint16_t* d_in_fp16;   /* Device pointer to fp16 data */
uint16_t* d_out_fp16;  /* Device pointer for results */
cudaMalloc(&d_in_fp16, n * batch_size * sizeof(uint16_t));
cudaMalloc(&d_out_fp16, n * batch_size * sizeof(uint16_t));
/* ... copy data to device ... */
int status = fwht_batch_f16_cuda_device(d_in_fp16, d_out_fp16, n, batch_size);
#endif

Device-pointer APIs for zero-copy GPU-resident buffers (no H2D/D2H transfers):

  • fwht_batch_i32_cuda_device() - int32 on device
  • fwht_batch_f64_cuda_device() - float64 on device
  • fwht_batch_f32_cuda_device() - float32 on device (25× faster)
  • fwht_batch_f16_cuda_device() - float16 Tensor Cores (up to 54× faster with PyTorch DLPack)

FP16 Tensor Core Notes:

  • Requires NVIDIA GPU with SM 7.0+ (Volta, Turing, Ampere, Ada, Hopper)
  • Provides 30-54× speedup vs fp64 (54× with PyTorch DLPack zero-copy) with small precision loss (~12% of results differ by ±1-4)
  • Ideal for ML training/inference, NOT for cryptanalysis requiring bit-exact results
  • Automatic runtime warning on first use (suppressible via FWHT_SILENCE_FP16_WARNING=1)

GPU Context API (Persistent Memory)

For applications computing many WHTs repeatedly, persistent GPU contexts eliminate malloc/free overhead:

#ifdef USE_CUDA
/* Create context with max transform size and batch size */
fwht_gpu_context_t* ctx = fwht_gpu_context_create(1024, 100);

/* Reuse pre-allocated GPU memory for many transforms */
for (int i = 0; i < 10000; i++) {
    fwht_gpu_context_compute_i32(ctx, data, 256, 1);
}

/* Clean up */
fwht_gpu_context_destroy(ctx);
#endif

Bit-packed Boolean GPU Context

Boolean workflows now have their own lightweight context that keeps the packed bitset and ±1 buffer resident on the device. It mirrors the floating-point context but fixes the batch size at 1 and caps the transform at 64K so the unpack kernel can stay in shared memory.

#ifdef USE_CUDA
fwht_gpu_boolean_context_t* bctx = fwht_gpu_boolean_context_create(32768);
fwht_status_t st = fwht_gpu_boolean_context_compute(bctx, packed_bits, spectrum, 32768);
fwht_gpu_boolean_context_destroy(bctx);
#endif

fwht_boolean_packed_backend(..., FWHT_BACKEND_GPU) uses a shared singleton of this context behind the scenes, so existing call sites automatically skip repeated cudaMalloc/cudaMemcpy traffic and see the same 5–10× reduction in overhead as the float/int context API. Use the explicit context when you need deterministic lifetime control or multiple concurrent contexts per device.

Performance benefit: 5-10× speedup for cryptanalysis workloads with many small transforms.

GPU Configuration and Tuning

#ifdef USE_CUDA
/* Query GPU capabilities */
if (fwht_has_gpu()) {
    printf("GPU: %s\n", fwht_gpu_get_device_name

Related Skills

View on GitHub
GitHub Stars8
CategoryDevelopment
Updated7d ago
Forks0

Languages

Cuda

Security Score

85/100

Audited on Mar 29, 2026

No findings