NumKong

SIMD-accelerated distances, dot products, matrix ops, geospatial & geometric kernels for 16 numeric types — from 6-bit floats to 64-bit complex — across x86, Arm, RISC-V, and WASM, with bindings for Python, Rust, C, C++, Swift, JS, and Go 📐

Generate Convert Improve

Install / Use

/learn @ashvardanian/NumKong

About this skill

Quality Score

0/100

README

NumKong: Mixed Precision for All

NumKong (previously SimSIMD) is a portable mixed-precision math library with over 2000 kernels for x86, Arm, RISC-V, and WASM. It covers numeric types from 6-bit floats to 64-bit complex numbers, hardened against in-house 118-bit extended-precision baselines. Built alongside the USearch vector-search engine, it provides wider accumulators to avoid the overflow and precision loss typical of naive same-type arithmetic.

NumKong banner

Latency, Throughput, & Numerical Stability

Most libraries return dot products in the same type as the input — Float16 × Float16 → Float16, Int8 × Int8 → Int8. This leads to quiet overflow: a 2048-dimensional i8 dot product can reach ±10 million, but i8 maxes out at 127. NumKong promotes to wider accumulators — Float16 → Float32, BFloat16 → Float32, Int8 → Int32, Float32 → Float64 — so results stay in range.

Single 2048-d dot product on Intel Sapphire Rapids, single-threaded. Each cell shows gso/s, mean relative error vs higher-precision reference. gso/s = Giga Scalar Operations per Second — a more suitable name than GFLOP/s when counting both integer and floating-point work. NumPy 2.4, PyTorch 2.10, JAX 0.9.

| Input | NumPy + OpenBLAS | PyTorch + MKL | JAX | NumKong | | :----- | ----------------------: | ----------------------: | ----------------------: | --------------------: | | | ░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░ | | f64 | 2.0 gso/s, 1e-15 err | 0.6 gso/s, 1e-15 err | 0.4 gso/s, 1e-14 err | 5.8 gso/s, 1e-16 err | | f32 | 1.5 gso/s, 2e-6 err | 0.6 gso/s, 2e-6 err | 0.4 gso/s, 5e-6 err | 7.1 gso/s, 2e-7 err | | bf16 | — | 0.5 gso/s, 1.9% err | 0.5 gso/s, 1.9% err | 9.7 gso/s, 1.8% err | | f16 | 0.2 gso/s, 0.25% err | 0.5 gso/s, 0.25% err | 0.4 gso/s, 0.25% err | 11.5 gso/s, 0.24% err | | e5m2 | — | 0.7 gso/s, 4.6% err | 0.5 gso/s, 4.6% err | 7.1 gso/s, 0% err | | i8 | 1.1 gso/s, overflow | 0.5 gso/s, overflow | 0.5 gso/s, overflow | 14.8 gso/s, 0% err |

A fair objection: PyTorch and JAX are designed for throughput, not single-call latency. They lower execution graphs through XLA or vendored BLAS libraries like Intel MKL and Nvidia cuBLAS. So here's the same comparison on a throughput-oriented workload — matrix multiplication:

Matrix multiplication (2048 × 2048) × (2048 × 2048) on Intel Sapphire Rapids, single-threaded. gso/s = Giga Scalar Operations per Second, same format. NumPy 2.4, PyTorch 2.10, JAX 0.9, same versions.

| Input | NumPy + OpenBLAS | PyTorch + MKL | JAX | NumKong | | :----- | ----------------------: | -----------------------: | -----------------------: | -------------------: | | | ░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░ | ░░░░░░░░░░░░░░ | | f64 | 65.5 gso/s, 1e-15 err | 68.2 gso/s, 1e-15 err | ~14.3 gso/s, 1e-15 err | 8.6 gso/s, 1e-16 err | | f32 | 140 gso/s, 9e-7 err | 145 gso/s, 1e-6 err | ~60.5 gso/s, 1e-6 err | 37.7 gso/s, 4e-7 err | | bf16 | — | 851 gso/s, 1.8% err | ~25.8 gso/s, 3.4% err | 458 gso/s, 3.6% err | | f16 | 0.3 gso/s, 0.25% err | 140 gso/s, 0.37% err | ~26.1 gso/s, 0.35% err | 103 gso/s, 0.26% err | | e5m2 | — | 0.4 gso/s, 4.6% err | ~26.4 gso/s, 4.6% err | 398 gso/s, 0% err | | i8 | 0.4 gso/s, overflow | 50.0 gso/s, overflow | ~0.0 gso/s, overflow | 1279 gso/s, 0% err |

For f64, compensated "Dot2" summation reduces error by 10–50× compared to naive Float64 accumulation, depending on vector length. For f32, widening to Float64 gives 5–10× lower error. The library ships as a relatively small binary:

| Package | Size | Parallelism & Memory | Available For | | :--------------- | -----: | :------------------------------------------------ | :---------------- | | PyTorch + MKL | 705 MB | Vector & Tile SIMD, OpenMP Threads, Hidden Allocs | Python, C++, Java | | JAX + jaxlib | 357 MB | Vector SIMD, XLA Threads, Hidden Allocs | Python | | NumPy + OpenBLAS | 30 MB | Vector SIMD, Built-in Threads, Hidden Allocs | Python | | mathjs | 9 MB | No SIMD, No Threads, Many Allocs | JS | | NumKong | 5 MB | Vector & Tile SIMD, Your Threads, Your Allocs | 7 languages |

Every kernel is validated against 118-bit extended-precision baselines with per-type ULP budgets across log-normal, uniform, and Cauchy input distributions. Tests check triangle inequality, Cauchy-Schwarz bounds, NaN propagation, overflow detection, and probability-simplex constraints for each ISA variant. Results are cross-validated against OpenBLAS, Intel MKL, and Apple Accelerate. A broader throughput comparison is maintained in NumWars.

Quick Start

What's Inside

NumKong covers 16 numeric types — from 6-bit floats to 64-bit complex numbers — across dozens of operations and 30+ SIMD backends, with hardware-aware defaults: Arm prioritizes f16, x86 prioritizes bf16.

<div align="center"> <pre><code> ┌──────────────────────────────┬────────────────┬───────────────────────────┬────────────┐ │ Operations │ Datatypes │ Backends │ Ecosystems │ ├──────────────────────────────┼────────────────┼───────────────────────────┼────────────┤ │ Vector-Vector │ <a href="#numeric-types">Bits & Ints</a> │ <a href="#compile-time-and-run-time-dispatch">x86</a> │ Core │ │ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#dot-products">dot</a> · <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#dense-distances">angular</a> · <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#dense-distances">euclidean</a> │ u1 · u4 · u8 │ Haswell · Alder Lake │ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#the-c-abi">C 99</a> │ │ hamming · kld · jsd · … │ i4 · i8 │ Sierra Forest · Skylake │ │ │ │ │ Ice Lake · Genoa · Turin │ Primary │ │ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#packed-matrix-kernels-for-gemm-like-workloads">Matrix-Matrix</a> │ <a href="#mini-floats-e4m3-e5m2-e3m2--e2m3">Mini-floats</a> │ Sapphire Rapids · │ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#the-c-layer">C++ 23</a> │ │ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#packed-matrix-kernels-for-gemm-like-workloads">dots_packed</a> · <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#symmetric-kernels-for-syrk-like-workloads">dots_symmetric</a> │ e2m3 · e3m2 │ Granite Rapids │ <a href="https://github.com/ashvardanian/NumKong/blob/main/python/README.md">Python 3</a> │ │ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#packed-matrix-kernels-for-gemm-like-workloads">euclideans_packed</a> · … │ e4m3 · e5m2 │ │ <a href="https://github.com/ashvardanian/NumKong/blob/main/rust/README.md">Rust</a> │ │ │ │ <a href="#compile-time-and-run-time-dispatch">Arm</a> │ │ │ Quadratic │ <a href="#float16--bfloat16-half-precision">Half & Classic</a> │ NEON · NEONHalf · NEONFhm │ Additional │ │ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#curved-metrics">bilinear</a> · mahalanobis │ f16 · bf16 │ NEONBFDot · NEONSDot │ <a href="https://github.com/ashvardanian/NumKong/blob/main/swift/README.md">Swift</a> · <a href="https://github.com/ashvardanian/NumKong/blob/main/javascript/README.md">JS</a> │ │ │ f32 · f64 │ SVE · SVEHalf · SVEBfDot │ <a href="https://github.com/ashvardanian/NumKong/blob/main/golang/README.md">Go</a> │ │ <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#geospatial-metrics">Geospatial</a> & <a href="https://github.com/ashvardanian/NumKong/blob/main/include/README.md#geometric-mesh-alignment">Geometric<

Related Skills

himalaya

330.3k

CLI to manage emails via IMAP/SMTP. Use `himalaya` to list, read, write, reply, forward, search, and organize emails from the terminal. Supports multiple accounts and message composition with MML (MIME Meta Language).

node-connect

330.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

xurl

330.3k

A CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.

frontend-design

81.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

ashvardanian

View profile

View on GitHub

GitHub Stars1.7k

CategoryDevelopment

Updated5h ago

Forks111

ashvardanian/NumKong

Languages

Security Score

100/100

Audited on Mar 22, 2026

No findings