Silarray
Adaptive Numerical Computing Library for Apple Silicon
Install / Use
/learn @yhirose/SilarrayREADME
Silicon Array
Numerical Computing Library for Apple Silicon
- Header-only C++23 library -- just
#include <silarray.h> - Switchable CPU/GPU backend via
sil::use_cpu()/sil::use_mps()(default: GPU) - CPU: Accelerate framework (vDSP, CBLAS, NEON)
- GPU: Metal Shading Language (MSL) with STEEL SGEMM kernel, implicit GEMM conv2d, and optimized reduction/softmax/layer_norm kernels
- Lazy evaluation with expression templates and affine fusion for chained elementwise operations
- Data types:
float,int,bool
Requirements
- macOS with Apple Silicon
- Xcode Command Line Tools (clang++ with C++23 support)
- Frameworks: Metal, Accelerate, MetalPerformanceShaders, Foundation
Example
#include <silarray.h>
auto a = sil::ones<float>({1000, 1000});
auto b = sil::ones<float>({1000, 1000});
auto c = a + b; // runs on GPU (default)
auto d = a.dot(b);
sil::use_cpu(); // switch to CPU backend
auto e = a + b; // runs on CPU
Operations
CPU/GPU switchable
| Category | Operations |
|----------|-----------|
| Arithmetic | + - * / pow (elementwise, with broadcasting) |
| In-place | += -= *= /= |
| Linear algebra | dot (STEEL SGEMM on GPU, CBLAS on CPU) |
| Activations | sigmoid relu softmax layer_norm |
| Fused ops | linear (dot + bias), linear_sigmoid (dot + bias + sigmoid) |
| Reduction | sum sum(axis) min max argmax |
| Convolution | conv2d (implicit GEMM on GPU, NHWC layout) |
CPU only
| Category | Operations |
|----------|-----------|
| Comparison | == != > < >= <= |
| Shape | clone transpose reshape broadcast |
| Creation | empty zeros ones random constants |
| Reduction | mean mean(axis) count all |
| NN utilities | mean_square_error one_hot sigmoid_backward |
| Selection | where(condition, x, y) |
| Testing | array_equal allclose |
Performance
Competitive with MLX across most operations on Apple M1 Pro:
| Category | vs MLX | |----------|--------| | Elementwise (add, mul, div, pow) | Same speed | | Reduction (sum, min, max) | 1.0–1.6x faster | | Softmax, Layer Norm | 1.0–3.7x faster | | SGEMM (square, 1024–4096) | Same speed | | SGEMM (small-batch) | Up to 2.1x faster | | Conv2d (ResNet mid) | Same speed | | Transformer inference | 1.0–1.5x faster | | MLP inference (batch=1024) | Same speed | | Training (backward pass) | 2–5x slower (eager dispatch model) |
See bench/README.md for detailed results.
Build and Run
Unit tests
cd test
make
Tests can be run in different device modes:
./test # GPU mode (default)
./test --gpu # explicit GPU
./test --cpu # CPU mode
MNIST
cd test
make mnist
./mnist
Benchmarks
Benchmarks compare against Eigen, MLX, libtorch, and ggml.
just bench-all # all benchmarks
just bench-micro # micro only
# or manually:
cd bench
make run # all benchmarks
make table # Markdown table output
See bench/README.md for setup instructions and full results.
Architecture
include/
silarray.h Main header (includes all below)
array.h Core array class with expression templates
cpu.h CPU backend (Accelerate: vDSP, CBLAS, NEON)
gpu.h GPU backend (Metal/MSL kernels + MPS fallback)
device.h Device selection (CPU/MPS switch)
types.h Type concepts (float, int, bool)
objc.h Objective-C bridge for Metal API
unified_memory.h GPU shared memory management
License
MIT license (c) 2026 Yuji Hirose
Related Skills
node-connect
352.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
