SkillAgentSearch skills...

NumKong

SIMD-accelerated distances, dot products, matrix ops, geospatial & geometric kernels for 16 numeric types โ€” from 6-bit floats to 64-bit complex โ€” across x86, Arm, RISC-V, and WASM, with bindings for Python, Rust, C, C++, Swift, JS, and Go ๐Ÿ“

Install / Use

/learn @ashvardanian/NumKong

README

NumKong: Mixed Precision for All

NumKong (previously SimSIMD) is a portable mixed-precision math library with over 2000 kernels for x86, Arm, RISC-V, and WASM. It covers numeric types from 6-bit floats to 64-bit complex numbers, hardened against in-house 118-bit extended-precision baselines. Built alongside the USearch vector-search engine, it provides wider accumulators to avoid the overflow and precision loss typical of naive same-type arithmetic.

NumKong banner

Latency, Throughput, & Numerical Stability

Most libraries return dot products in the same type as the input โ€” Float16 ร— Float16 โ†’ Float16, Int8 ร— Int8 โ†’ Int8. This leads to quiet overflow: a 2048-dimensional i8 dot product can reach ยฑ10 million, but i8 maxes out at 127. NumKong promotes to wider accumulators โ€” Float16 โ†’ Float32, BFloat16 โ†’ Float32, Int8 โ†’ Int32, Float32 โ†’ Float64 โ€” so results stay in range.

Single 2048-d dot product on Intel Sapphire Rapids, single-threaded. Each cell shows gso/s, mean relative error vs higher-precision reference. gso/s = Giga Scalar Operations per Second โ€” a more suitable name than GFLOP/s when counting both integer and floating-point work. NumPy 2.4, PyTorch 2.10, JAX 0.9.

| Input | NumPy + OpenBLAS | PyTorch + MKL | JAX | NumKong | | :----- | ----------------------: | ----------------------: | ----------------------: | --------------------: | | | โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ | โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ | โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ | โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ | | f64 | 2.0 gso/s, 1e-15 err | 0.6 gso/s, 1e-15 err | 0.4 gso/s, 1e-14 err | 5.8 gso/s, 1e-16 err | | f32 | 1.5 gso/s, 2e-6 err | 0.6 gso/s, 2e-6 err | 0.4 gso/s, 5e-6 err | 7.1 gso/s, 2e-7 err | | bf16 | โ€” | 0.5 gso/s, 1.9% err | 0.5 gso/s, 1.9% err | 9.7 gso/s, 1.8% err | | f16 | 0.2 gso/s, 0.25% err | 0.5 gso/s, 0.25% err | 0.4 gso/s, 0.25% err | 11.5 gso/s, 0.24% err | | e5m2 | โ€” | 0.7 gso/s, 4.6% err | 0.5 gso/s, 4.6% err | 7.1 gso/s, 0% err | | i8 | 1.1 gso/s, overflow | 0.5 gso/s, overflow | 0.5 gso/s, overflow | 14.8 gso/s, 0% err |

A fair objection: PyTorch and JAX are designed for throughput, not single-call latency. They lower execution graphs through XLA or vendored BLAS libraries like Intel MKL and Nvidia cuBLAS. So here's the same comparison on a throughput-oriented workload โ€” matrix multiplication:

Matrix multiplication (2048 ร— 2048) ร— (2048 ร— 2048) on Intel Sapphire Rapids, single-threaded. gso/s = Giga Scalar Operations per Second, same format. NumPy 2.4, PyTorch 2.10, JAX 0.9, same versions.

| Input | NumPy + OpenBLAS | PyTorch + MKL | JAX | NumKong | | :----- | ----------------------: | -----------------------: | -----------------------: | -------------------: | | | โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ | โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ | โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ | โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ | | f64 | 65.5 gso/s, 1e-15 err | 68.2 gso/s, 1e-15 err | ~14.3 gso/s, 1e-15 err | 8.6 gso/s, 1e-16 err | | f32 | 140 gso/s, 9e-7 err | 145 gso/s, 1e-6 err | ~60.5 gso/s, 1e-6 err | 37.7 gso/s, 4e-7 err | | bf16 | โ€” | 851 gso/s, 1.8% err | ~25.8 gso/s, 3.4% err | 458 gso/s, 3.6% err | | f16 | 0.3 gso/s, 0.25% err | 140 gso/s, 0.37% err | ~26.1 gso/s, 0.35% err | 103 gso/s, 0.26% err | | e5m2 | โ€” | 0.4 gso/s, 4.6% err | ~26.4 gso/s, 4.6% err | 398 gso/s, 0% err | | i8 | 0.4 gso/s, overflow | 50.0 gso/s, overflow | ~0.0 gso/s, overflow | 1279 gso/s, 0% err |

For f64, compensated "Dot2" summation reduces error by 10โ€“50ร— compared to naive Float64 accumulation, depending on vector length. For f32, widening to Float64 gives 5โ€“10ร— lower error. The library ships as a relatively small binary:

| Package | Size | Parallelism & Memory | Available For | | :--------------- | -----: | :------------------------------------------------ | :---------------- | | PyTorch + MKL | 705 MB | Vector & Tile SIMD, OpenMP Threads, Hidden Allocs | Python, C++, Java | | JAX + jaxlib | 357 MB | Vector SIMD, XLA Threads, Hidden Allocs | Python | | NumPy + OpenBLAS | 30 MB | Vector SIMD, Built-in Threads, Hidden Allocs | Python | | mathjs | 9 MB | No SIMD, No Threads, Many Allocs | JS | | NumKong | 5 MB | Vector & Tile SIMD, Your Threads, Your Allocs | 7 languages |

Every kernel is validated against 118-bit extended-precision baselines with per-type ULP budgets across log-normal, uniform, and Cauchy input distributions. Tests check triangle inequality, Cauchy-Schwarz bounds, NaN propagation, overflow detection, and probability-simplex constraints for each ISA variant. Results are cross-validated against OpenBLAS, Intel MKL, and Apple Accelerate. A broader throughput comparison is maintained in NumWars.

Quick Start

| Language | Install | Compatible with | Guide | | :------- | :------------------------- | :----------------------------- | :------------------------------------------- | | C / C++ | CMake, headers, & prebuilt | Linux, macOS, Windows, Android | include/README.md | | Python | pip install | Linux, macOS, Windows | python/README.md | | Rust | cargo add | Linux, macOS, Windows | rust/README.md | | JS | npm install & import | Node.js, Bun, Deno & browsers | javascript/README.md | | Swift | Swift Package Manager | macOS, iOS, tvOS, watchOS | swift/README.md | | Go | go get | Linux, macOS, Windows via cGo | golang/README.md |

What's Inside

NumKong covers 16 numeric types โ€” from 6-bit floats to 64-bit complex numbers โ€” across dozens of operations and 30+ SIMD backends, with hardware-aware defaults: Arm prioritizes f16, x86 prioritizes bf16.

<div align="center"> <pre><code> โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Operations โ”‚ Datatypes โ”‚ Backends โ”‚ Ecosystems โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Vector-Vector โ”‚ <a href="#numeric-types">Bits &amp; Ints</a> โ”‚ <a href="#compile-time-and-run-time-dispatch">x86</a> โ”‚ Core โ”‚ โ”‚ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#dot-products">dot</a> ยท <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#dense-distances">angular</a> ยท <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#dense-distances">euclidean</a> โ”‚ u1 ยท u4 ยท u8 โ”‚ Haswell ยท Alder Lake โ”‚ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#the-c-abi">C 99</a> โ”‚ โ”‚ hamming ยท kld ยท jsd ยท โ€ฆ โ”‚ i4 ยท i8 โ”‚ Sierra Forest ยท Skylake โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Ice Lake ยท Genoa ยท Turin โ”‚ Primary โ”‚ โ”‚ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#packed-matrix-kernels-for-gemm-like-workloads">Matrix-Matrix</a> โ”‚ <a href="#mini-floats-e4m3-e5m2-e3m2--e2m3">Mini-floats</a> โ”‚ Sapphire Rapids ยท โ”‚ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#the-c-layer">C++ 23</a> โ”‚ โ”‚ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#packed-matrix-kernels-for-gemm-like-workloads">dots_packed</a> ยท <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#symmetric-kernels-for-syrk-like-workloads">dots_symmetric</a> โ”‚ e2m3 ยท e3m2 โ”‚ Granite Rapids โ”‚ <a href="https://github.com/ashvardanian/NumKong/blob/main/python/README.md">Python 3</a> โ”‚ โ”‚ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#packed-matrix-kernels-for-gemm-like-workloads">euclideans_packed</a> ยท โ€ฆ โ”‚ e4m3 ยท e5m2 โ”‚ โ”‚ <a href="https://github.com/ashvardanian/NumKong/blob/main/rust/README.md">Rust</a> โ”‚ โ”‚ โ”‚ โ”‚ <a href="#compile-time-and-run-time-dispatch">Arm</a> โ”‚ โ”‚ โ”‚ Quadratic โ”‚ <a href="#float16--bfloat16-half-precision">Half &amp; Classic</a> โ”‚ NEON ยท NEONHalf ยท NEONFhm โ”‚ Additional โ”‚ โ”‚ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#curved-metrics">bilinear</a> ยท mahalanobis โ”‚ f16 ยท bf16 โ”‚ NEONBFDot ยท NEONSDot โ”‚ <a href="https://github.com/ashvardanian/NumKong/blob/main/swift/README.md">Swift</a> ยท <a href="https://github.com/ashvardanian/NumKong/blob/main/javascript/README.md">JS</a> โ”‚ โ”‚ โ”‚ f32 ยท f64 โ”‚ SVE ยท SVEHalf ยท SVEBfDot โ”‚ <a href="https://github.com/ashvardanian/NumKong/blob/main/golang/README.md">Go</a> โ”‚ โ”‚ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#geospatial-metrics">Geospatial</a> &amp; <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#geometric-mesh-alignment">Geometric<

Related Skills

View on GitHub
GitHub Stars1.7k
CategoryDevelopment
Updated5h ago
Forks111

Languages

C

Security Score

100/100

Audited on Mar 22, 2026

No findings