AiDotNet.Tensors
The fastest .NET tensor library. Beats MathNet (6x), NumSharp (3200x), matches TorchSharp CPU - pure managed C# with hand-tuned AVX2/FMA SIMD kernels. Optional CUDA/OpenCL GPU acceleration.
Install / Use
/learn @ooples/AiDotNet.TensorsREADME
AiDotNet.Tensors
The fastest .NET tensor library. Beats MathNet, NumSharp, TensorPrimitives, and matches TorchSharp CPU on pure managed code with hand-tuned AVX2/FMA SIMD kernels and JIT-compiled machine code.
Features
- Zero Allocations: In-place operations with
ArrayPool<T>andSpan<T>for hot paths - Hand-Tuned SIMD: Custom AVX2/FMA kernels with 4x loop unrolling, not just
Vector<T>wrappers - JIT-Compiled Kernels: Runtime x86-64 machine code generation for size-specialized operations
- BLIS-Style GEMM: Tiled matrix multiply with FMA micro-kernel, cache-aware panel packing
- GPU Acceleration: Optional CUDA, HIP/ROCm, and OpenCL support via separate packages
- Multi-Target: Supports .NET 10.0 and .NET Framework 4.7.1
- Generic Math: Works with any numeric type via
INumericOperations<T>interface
Installation
# Core package (CPU SIMD acceleration)
dotnet add package AiDotNet.Tensors
# Optional: OpenBLAS for optimized CPU BLAS operations
dotnet add package AiDotNet.Native.OpenBLAS
# Optional: CLBlast for OpenCL GPU acceleration (AMD/Intel/NVIDIA)
dotnet add package AiDotNet.Native.CLBlast
# Optional: CUDA for NVIDIA GPU acceleration (requires NVIDIA GPU)
dotnet add package AiDotNet.Native.CUDA
Quick Start
using AiDotNet.Tensors.LinearAlgebra;
// Create vectors
var v1 = new Vector<double>(new[] { 1.0, 2.0, 3.0, 4.0 });
var v2 = new Vector<double>(new[] { 5.0, 6.0, 7.0, 8.0 });
// SIMD-accelerated operations
var sum = v1 + v2;
var dot = v1.Dot(v2);
// Create matrices
var m1 = new Matrix<double>(3, 3);
var m2 = Matrix<double>.Identity(3);
// Matrix operations
var product = m1 * m2;
var transpose = m1.Transpose();
CPU Benchmarks
All benchmarks run on AMD Ryzen 9 3950X, .NET 10.0, BenchmarkDotNet. No AVX-512.
vs TorchSharp CPU (Tensor Operations, float)
Head-to-head against TorchSharp's libtorch C++ backend on identical data sizes.
| Operation | AiDotNet | TorchSharp | Speedup | Result | |-----------|----------|------------|---------|--------| | MatMul 256x256 | 95 us | 125 us | 1.3x faster | WIN | | MatMul 512x512 | 427 us | 533 us | 1.2x faster | WIN | | Mean 1M | 194 us | 224 us | 1.2x faster | WIN | | Add 100K | 30 us | 30 us | tied | TIED | | Multiply 100K | 42 us | 42 us | tied | TIED | | Sum 1M | 200 us | 183 us | 0.9x | Close | | Sigmoid 1M | 222 us | 196 us | 0.9x | Close | | Add 1M | 209 us | 182 us | 0.9x | Close | | ReLU 1M | 196 us | 169 us | 0.9x | Close |
AiDotNet wins or matches TorchSharp CPU on the majority of operations using pure managed C# with hand-tuned SIMD, no native C++ dependencies required.
vs MathNet.Numerics (Linear Algebra, double, N=1000)
| Operation | AiDotNet | MathNet | Speedup | |-----------|----------|---------|---------| | Matrix Multiply 1000x1000 | 8.3 ms | 49.2 ms | 6x faster | | Matrix Add | 1.87 ms | 2.50 ms | 1.3x faster | | Matrix Subtract | 2.08 ms | 2.47 ms | 1.2x faster | | Matrix Scalar Multiply | 1.66 ms | 2.14 ms | 1.3x faster | | Transpose | 2.85 ms | 3.68 ms | 1.3x faster | | Dot Product | 97 ns | 817 ns | 8.4x faster | | L2 Norm | 92 ns | 11,552 ns | 125x faster |
vs NumSharp (N=1000)
| Operation | AiDotNet | NumSharp | Speedup | |-----------|----------|----------|---------| | Matrix Multiply 1000x1000 | 8.3 ms | 26.5 s | 3,200x faster | | Matrix Add | 1.87 ms | 1.98 ms | 1.1x faster | | Transpose | 2.85 ms | 13.7 ms | 4.8x faster | | Vector Add | 1.47 us | 54.5 us | 37x faster |
vs System.Numerics.Tensors.TensorPrimitives (N=1000)
In-place operations (zero allocation) compared to raw TensorPrimitives calls.
| Operation | AiDotNet | TensorPrimitives | Speedup | |-----------|----------|-----------------|---------| | Dot Product | 97 ns | 185 ns | 1.9x faster | | L2 Norm | 92 ns | 187 ns | 2.0x faster | | Vector AddInPlace | 154 ns | 117 ns | 0.8x | | Vector SubtractInPlace | 116 ns | 118 ns | tied | | Vector ScalarMulInPlace | 105 ns | 75 ns | 0.7x | | Vector Add to Span | 116 ns | 119 ns | tied |
Small Matrix Multiply (double)
| Size | AiDotNet | MathNet | NumSharp | |------|----------|---------|----------| | 4x4 | 172 ns | 165 ns | 2,198 ns | | 16x16 | 2.1 us | 2.9 us | 107.5 us | | 32x32 | 10.5 us | 36.2 us | 774.8 us |
AiDotNet is 1.4x faster at 16x16 and 3.4x faster at 32x32 than MathNet.
SIMD Instruction Support
The library automatically detects and uses the best available SIMD instructions:
| Instruction Set | Vector Width | Supported | |----------------|--------------|-----------| | AVX-512 | 512-bit (16 floats) | .NET 8+ | | AVX2 + FMA | 256-bit (8 floats) | .NET 6+ | | AVX | 256-bit (8 floats) | .NET 6+ | | SSE4.2 | 128-bit (4 floats) | .NET 6+ | | ARM NEON | 128-bit (4 floats) | .NET 6+ |
Check Available Acceleration
using AiDotNet.Tensors.Engines;
var caps = PlatformDetector.Capabilities;
// SIMD capabilities
Console.WriteLine($"AVX2: {caps.HasAVX2}");
Console.WriteLine($"AVX-512: {caps.HasAVX512F}");
// GPU support
Console.WriteLine($"CUDA: {caps.HasCudaSupport}");
Console.WriteLine($"OpenCL: {caps.HasOpenCLSupport}");
// Native library availability
Console.WriteLine($"OpenBLAS: {caps.HasOpenBlas}");
Console.WriteLine($"CLBlast: {caps.HasClBlast}");
// Or get a full status summary
Console.WriteLine(NativeLibraryDetector.GetStatusSummary());
Optional Acceleration Packages
AiDotNet.Native.OpenBLAS
Provides optimized CPU BLAS operations using OpenBLAS:
dotnet add package AiDotNet.Native.OpenBLAS
Performance: Accelerated BLAS operations for matrix multiply and decompositions.
AiDotNet.Native.CLBlast
Provides GPU acceleration via OpenCL (works on AMD, Intel, and NVIDIA GPUs):
dotnet add package AiDotNet.Native.CLBlast
Performance: 10x+ faster for large matrix operations on GPU.
AiDotNet.Native.CUDA
Provides GPU acceleration via NVIDIA CUDA (NVIDIA GPUs only):
dotnet add package AiDotNet.Native.CUDA
Performance: 30,000+ GFLOPS for matrix operations on modern NVIDIA GPUs.
Requirements:
- NVIDIA GPU (GeForce, Quadro, or Tesla)
- NVIDIA display driver 525.60+ (includes CUDA driver)
Usage with helpful error messages:
using AiDotNet.Tensors.Engines.DirectGpu.CUDA;
// Recommended: throws beginner-friendly exception if CUDA unavailable
using var cuda = CudaBackend.CreateOrThrow();
// Or check availability first
if (CudaBackend.IsCudaAvailable)
{
using var backend = new CudaBackend();
// Use CUDA acceleration
}
If CUDA is not available, you'll get detailed troubleshooting steps explaining exactly what's missing and how to fix it.
Requirements
- .NET 10.0 or .NET Framework 4.7.1+
- Windows x64, Linux x64, or macOS x64/arm64
License
Apache 2.0 - See LICENSE for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Related Skills
node-connect
348.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
348.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
348.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
