NumKong
SIMD-accelerated distances, dot products, matrix ops, geospatial & geometric kernels for 16 numeric types โ from 6-bit floats to 64-bit complex โ across x86, Arm, RISC-V, and WASM, with bindings for Python, Rust, C, C++, Swift, JS, and Go ๐
Install / Use
/learn @ashvardanian/NumKongREADME
NumKong: Mixed Precision for All
NumKong (previously SimSIMD) is a portable mixed-precision math library with over 2000 kernels for x86, Arm, RISC-V, and WASM. It covers numeric types from 6-bit floats to 64-bit complex numbers, hardened against in-house 118-bit extended-precision baselines. Built alongside the USearch vector-search engine, it provides wider accumulators to avoid the overflow and precision loss typical of naive same-type arithmetic.

Latency, Throughput, & Numerical Stability
Most libraries return dot products in the same type as the input โ Float16 ร Float16 โ Float16, Int8 ร Int8 โ Int8.
This leads to quiet overflow: a 2048-dimensional i8 dot product can reach ยฑ10 million, but i8 maxes out at 127.
NumKong promotes to wider accumulators โ Float16 โ Float32, BFloat16 โ Float32, Int8 โ Int32, Float32 โ Float64 โ so results stay in range.
Single 2048-d dot product on Intel Sapphire Rapids, single-threaded. Each cell shows gso/s, mean relative error vs higher-precision reference. gso/s = Giga Scalar Operations per Second โ a more suitable name than GFLOP/s when counting both integer and floating-point work. NumPy 2.4, PyTorch 2.10, JAX 0.9.
| Input | NumPy + OpenBLAS | PyTorch + MKL | JAX | NumKong |
| :----- | ----------------------: | ----------------------: | ----------------------: | --------------------: |
| | โโโโโโโโโโโโโโ | โโโโโโโโโโโโโโ | โโโโโโโโโโโโโโ | โโโโโโโโโโโโโโ |
| f64 | 2.0 gso/s, 1e-15 err | 0.6 gso/s, 1e-15 err | 0.4 gso/s, 1e-14 err | 5.8 gso/s, 1e-16 err |
| f32 | 1.5 gso/s, 2e-6 err | 0.6 gso/s, 2e-6 err | 0.4 gso/s, 5e-6 err | 7.1 gso/s, 2e-7 err |
| bf16 | โ | 0.5 gso/s, 1.9% err | 0.5 gso/s, 1.9% err | 9.7 gso/s, 1.8% err |
| f16 | 0.2 gso/s, 0.25% err | 0.5 gso/s, 0.25% err | 0.4 gso/s, 0.25% err | 11.5 gso/s, 0.24% err |
| e5m2 | โ | 0.7 gso/s, 4.6% err | 0.5 gso/s, 4.6% err | 7.1 gso/s, 0% err |
| i8 | 1.1 gso/s, overflow | 0.5 gso/s, overflow | 0.5 gso/s, overflow | 14.8 gso/s, 0% err |
A fair objection: PyTorch and JAX are designed for throughput, not single-call latency. They lower execution graphs through XLA or vendored BLAS libraries like Intel MKL and Nvidia cuBLAS. So here's the same comparison on a throughput-oriented workload โ matrix multiplication:
Matrix multiplication (2048 ร 2048) ร (2048 ร 2048) on Intel Sapphire Rapids, single-threaded. gso/s = Giga Scalar Operations per Second, same format. NumPy 2.4, PyTorch 2.10, JAX 0.9, same versions.
| Input | NumPy + OpenBLAS | PyTorch + MKL | JAX | NumKong |
| :----- | ----------------------: | -----------------------: | -----------------------: | -------------------: |
| | โโโโโโโโโโโโโโ | โโโโโโโโโโโโโโ | โโโโโโโโโโโโโโ | โโโโโโโโโโโโโโ |
| f64 | 65.5 gso/s, 1e-15 err | 68.2 gso/s, 1e-15 err | ~14.3 gso/s, 1e-15 err | 8.6 gso/s, 1e-16 err |
| f32 | 140 gso/s, 9e-7 err | 145 gso/s, 1e-6 err | ~60.5 gso/s, 1e-6 err | 37.7 gso/s, 4e-7 err |
| bf16 | โ | 851 gso/s, 1.8% err | ~25.8 gso/s, 3.4% err | 458 gso/s, 3.6% err |
| f16 | 0.3 gso/s, 0.25% err | 140 gso/s, 0.37% err | ~26.1 gso/s, 0.35% err | 103 gso/s, 0.26% err |
| e5m2 | โ | 0.4 gso/s, 4.6% err | ~26.4 gso/s, 4.6% err | 398 gso/s, 0% err |
| i8 | 0.4 gso/s, overflow | 50.0 gso/s, overflow | ~0.0 gso/s, overflow | 1279 gso/s, 0% err |
For f64, compensated "Dot2" summation reduces error by 10โ50ร compared to naive Float64 accumulation, depending on vector length.
For f32, widening to Float64 gives 5โ10ร lower error.
The library ships as a relatively small binary:
| Package | Size | Parallelism & Memory | Available For | | :--------------- | -----: | :------------------------------------------------ | :---------------- | | PyTorch + MKL | 705 MB | Vector & Tile SIMD, OpenMP Threads, Hidden Allocs | Python, C++, Java | | JAX + jaxlib | 357 MB | Vector SIMD, XLA Threads, Hidden Allocs | Python | | NumPy + OpenBLAS | 30 MB | Vector SIMD, Built-in Threads, Hidden Allocs | Python | | mathjs | 9 MB | No SIMD, No Threads, Many Allocs | JS | | NumKong | 5 MB | Vector & Tile SIMD, Your Threads, Your Allocs | 7 languages |
Every kernel is validated against 118-bit extended-precision baselines with per-type ULP budgets across log-normal, uniform, and Cauchy input distributions. Tests check triangle inequality, Cauchy-Schwarz bounds, NaN propagation, overflow detection, and probability-simplex constraints for each ISA variant. Results are cross-validated against OpenBLAS, Intel MKL, and Apple Accelerate. A broader throughput comparison is maintained in NumWars.
Quick Start
| Language | Install | Compatible with | Guide |
| :------- | :------------------------- | :----------------------------- | :------------------------------------------- |
| C / C++ | CMake, headers, & prebuilt | Linux, macOS, Windows, Android | include/README.md |
| Python | pip install | Linux, macOS, Windows | python/README.md |
| Rust | cargo add | Linux, macOS, Windows | rust/README.md |
| JS | npm install & import | Node.js, Bun, Deno & browsers | javascript/README.md |
| Swift | Swift Package Manager | macOS, iOS, tvOS, watchOS | swift/README.md |
| Go | go get | Linux, macOS, Windows via cGo | golang/README.md |
What's Inside
NumKong covers 16 numeric types โ from 6-bit floats to 64-bit complex numbers โ across dozens of operations and 30+ SIMD backends, with hardware-aware defaults: Arm prioritizes f16, x86 prioritizes bf16.
Related Skills
himalaya
330.3kCLI to manage emails via IMAP/SMTP. Use `himalaya` to list, read, write, reply, forward, search, and organize emails from the terminal. Supports multiple accounts and message composition with MML (MIME Meta Language).
node-connect
330.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
xurl
330.3kA CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.
frontend-design
81.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
