SkillAgentSearch skills...

Silarray

Adaptive Numerical Computing Library for Apple Silicon

Install / Use

/learn @yhirose/Silarray
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Silicon Array

Numerical Computing Library for Apple Silicon

  • Header-only C++23 library -- just #include <silarray.h>
  • Switchable CPU/GPU backend via sil::use_cpu() / sil::use_mps() (default: GPU)
  • CPU: Accelerate framework (vDSP, CBLAS, NEON)
  • GPU: Metal Shading Language (MSL) with STEEL SGEMM kernel, implicit GEMM conv2d, and optimized reduction/softmax/layer_norm kernels
  • Lazy evaluation with expression templates and affine fusion for chained elementwise operations
  • Data types: float, int, bool

Requirements

  • macOS with Apple Silicon
  • Xcode Command Line Tools (clang++ with C++23 support)
  • Frameworks: Metal, Accelerate, MetalPerformanceShaders, Foundation

Example

#include <silarray.h>

auto a = sil::ones<float>({1000, 1000});
auto b = sil::ones<float>({1000, 1000});

auto c = a + b;       // runs on GPU (default)
auto d = a.dot(b);

sil::use_cpu();        // switch to CPU backend
auto e = a + b;        // runs on CPU

Operations

CPU/GPU switchable

| Category | Operations | |----------|-----------| | Arithmetic | + - * / pow (elementwise, with broadcasting) | | In-place | += -= *= /= | | Linear algebra | dot (STEEL SGEMM on GPU, CBLAS on CPU) | | Activations | sigmoid relu softmax layer_norm | | Fused ops | linear (dot + bias), linear_sigmoid (dot + bias + sigmoid) | | Reduction | sum sum(axis) min max argmax | | Convolution | conv2d (implicit GEMM on GPU, NHWC layout) |

CPU only

| Category | Operations | |----------|-----------| | Comparison | == != > < >= <= | | Shape | clone transpose reshape broadcast | | Creation | empty zeros ones random constants | | Reduction | mean mean(axis) count all | | NN utilities | mean_square_error one_hot sigmoid_backward | | Selection | where(condition, x, y) | | Testing | array_equal allclose |

Performance

Competitive with MLX across most operations on Apple M1 Pro:

| Category | vs MLX | |----------|--------| | Elementwise (add, mul, div, pow) | Same speed | | Reduction (sum, min, max) | 1.0–1.6x faster | | Softmax, Layer Norm | 1.0–3.7x faster | | SGEMM (square, 1024–4096) | Same speed | | SGEMM (small-batch) | Up to 2.1x faster | | Conv2d (ResNet mid) | Same speed | | Transformer inference | 1.0–1.5x faster | | MLP inference (batch=1024) | Same speed | | Training (backward pass) | 2–5x slower (eager dispatch model) |

See bench/README.md for detailed results.

Build and Run

Unit tests

cd test
make

Tests can be run in different device modes:

./test          # GPU mode (default)
./test --gpu    # explicit GPU
./test --cpu    # CPU mode

MNIST

cd test
make mnist
./mnist

Benchmarks

Benchmarks compare against Eigen, MLX, libtorch, and ggml.

just bench-all          # all benchmarks
just bench-micro        # micro only

# or manually:
cd bench
make run                # all benchmarks
make table              # Markdown table output

See bench/README.md for setup instructions and full results.

Architecture

include/
  silarray.h          Main header (includes all below)
  array.h             Core array class with expression templates
  cpu.h               CPU backend (Accelerate: vDSP, CBLAS, NEON)
  gpu.h               GPU backend (Metal/MSL kernels + MPS fallback)
  device.h            Device selection (CPU/MPS switch)
  types.h             Type concepts (float, int, bool)
  objc.h              Objective-C bridge for Metal API
  unified_memory.h    GPU shared memory management

License

MIT license (c) 2026 Yuji Hirose

Related Skills

View on GitHub
GitHub Stars16
CategoryDevelopment
Updated1d ago
Forks1

Languages

C++

Security Score

95/100

Audited on Apr 7, 2026

No findings