NCPU
nCPU: model-native and tensor-optimized CPU research runtimes with organized workloads, tools, and docs
Install / Use
/learn @robertcprice/NCPUREADME
Start in 60 Seconds
nCPU is most compelling when you treat it as a program-by-examples and text-by-examples machine first, then explore the deeper GPU and coprocessor stack.
# Best first-time install
pip install -e ".[demo,dev]"
# See the guided demo map
python -m ncpu.lab demos --verbose
# Flagship interactive experiences
python -m ncpu.lab discover
python -m ncpu.lab text --interactive
What works today:
| Experience | Status | Best platform | |------------|--------|---------------| | Interactive program discovery | Ready now | Cross-platform | | Neural text machine | Ready now | Cross-platform | | GPU BusyBox / Alpine demos | Ready now | macOS / Apple Silicon | | Coprocessor demo | Available with heavier deps | Cross-platform with model stack |
Recommended path:
- Discover a program from examples
- Discover a text transform or cipher
- Try the GPU systems demos
- Explore the coprocessor and deeper research modules
Tiny terminal preview:
$ python -m ncpu.lab discover
ncpu> preset fib
ncpu> synthesize
ncpu> summary
ncpu> test 13, 21
$ python -m ncpu.lab text --interactive
text> cipher hello khoor
text> summary
text> apply world
Further guides:
demos/README.md— curated demo map and starter transcriptsdocs/REPO_HYGIENE.md— what should stay in git vs stay localdocs/MAINTAINER_CLEANUP_CHECKLIST.md— pre-push cleanup checklist
Four Big Ideas
1. A Fully Differentiable CPU
Every ALU operation is a trained neural network --- addition, subtraction, multiplication, bitwise, shifts, division. Because the entire computation graph is differentiable, you can backpropagate through execution: optimizing programs via gradient descent, discovering better algorithms, tuning instruction schedules. No conventional CPU can do this. The trained neural ALU achieves 100% accuracy on 32-bit integer arithmetic, exhaustively verified over every possible input --- not sampled, proven.
2. A Complete AI Computer --- Fully Differentiable from Source Code to Execution
Not "AI running on a computer" --- an AI that is the computer, end to end, and every layer supports gradient flow. The neural ALU computes. The neural OS (neurOS) manages memory, schedules processes, compiles code --- 11 trained models, zero fallbacks, 93.7--100% accuracy. The full pipeline is differentiable: source code -> neural compiler -> neural assembler -> neural CPU -> result, all through trained models. This means you can optimize not just programs but the OS itself via gradient descent.
3. GPU as Self-Sufficient Computer
A single GPU chip running an entire computer --- no CPU required beyond initial bootstrap. The Metal compute shader executes ARM64 natively at 1.9M+ IPS, boots a multi-process UNIX OS with fork/pipe/wait, compiles C, loads and runs real Linux ELF binaries (BusyBox/Alpine Linux), and even runs a 2-instruction Turing-complete VM (MUXLEQ) that boots eForth. The GPU isn't an accelerator here. It's the whole machine --- complete with a self-hosting C compiler, 13+ compiled applications, and debugging tools impossible on conventional hardware.
4. Teaching Transformers to Compute --- The Differentiable Coprocessor
nCPU's trained neural ALU can be injected directly into any transformer's forward pass as a differentiable coprocessor. The coprocessor replaces MLP sublayers with a routed mixture: a learned per-token gate decides whether each token flows through the original MLP or through nCPU's neural ALU. Neural truth tables provide differentiable logic (AND/OR/XOR) via bilinear soft indexing, tensor ops provide differentiable arithmetic (ADD/SUB/MUL), and a confidence-aware gating mechanism modulates routing based on the model's own uncertainty. The entire path --- including the discrete logic operations --- supports gradient flow, so the transformer learns when to use the coprocessor through standard backpropagation.
An 11-model scaling sweep across the Qwen 2.5/3/3.5 families demonstrates the effect:
| Model | Synthetic Arithmetic Gain | Best Result | |-------|--------------------------|-------------| | Qwen3.5-2B (instruct) | 14.5% -> 71.0% (+56.5%) | Best overall | | Qwen3.5-2B (base) | 15.5% -> 63.0% (+47.5%) | 100% on ADD/SUB/MUL/DIV | | Qwen3.5-4B | +51.0% delta | Largest base sweep gain (tied) | | Qwen3.5-9B | +51.0% delta | Largest base sweep gain (tied) | | Qwen3.5-9B (instruct) | 8.0% -> 58.5% (+50.5%) | |
Real-world transfer on matched models (Qwen3.5-2B): coding preserved (60%), reasoning improved (0% -> 10%), +5% average with no degradation.
See the research paper, the standalone GPU debugging toolkit paper draft, and the wiki for detailed analysis.
Three CPU Modes
nCPU provides three complete execution modes --- each a different point in the design space, each fully functional:
| Mode | What Runs | Backend | Differentiable? | Speed |
|------|-----------|---------|-----------------|-------|
| Neural | 13 trained .pt models | PyTorch on GPU | Yes --- full gradient flow through every operation | ~5K IPS |
| Fast | Native tensor ops | PyTorch tensors | Yes --- standard autograd | ~5K IPS |
| Compute | Rust + Metal shader | Apple Silicon GPU | No (discrete hardware) | ~1.9M IPS |
Neural mode is the research core: every arithmetic operation, every OS decision, every compiler pass flows through trained neural networks. Addition uses a Kogge-Stone carry-lookahead adder built from neural full adders (8 passes). Multiplication uses a 256x256 byte-pair lookup tensor. Bitwise logic uses learned truth tables. The entire pipeline from source assembly to computed result is differentiable.
Fast mode skips the trained models and uses native PyTorch tensor operations for the same ISA --- same differentiability guarantees, without the overhead of model inference. Useful for rapid prototyping and as a correctness oracle.
Compute mode is the performance path: a Rust + Metal kernel executes ~200 ARM64 instructions (integer + floating-point) on the GPU at ~1.9M IPS with zero-copy StorageModeShared memory. This is where the UNIX OS boots, the compiler self-hosts, BusyBox runs, and Alpine Linux comes alive. ~500x faster compilation than the Python path.
All three modes execute the same programs and produce the same results. The neural and fast modes are fully differentiable; the compute mode trades gradient flow for raw speed.
Quick Start
Install paths:
# Best first-time install for the flagship interactive demos
pip install -e ".[demo,dev]"
# Broader local environment for coprocessor / training work
pip install -e ".[demo,model,train,dev]"
First commands to try:
# Unified launcher
python -m ncpu.lab demos
python -m ncpu.lab discover
python -m ncpu.lab text --interactive
# Direct demo entrypoints
PYTHONPATH=. python demos/interactive_discovery.py
PYTHONPATH=. python demos/neural_text_machine.py --interactive
# Neural mode --- all arithmetic through trained neural networks
python main.py --program programs/fibonacci.asm
# GPU compute mode --- Metal shader, ~1.9M IPS
python main.py --program programs/fibonacci.asm --compute
# GPU UNIX OS --- 25-command shell with fork/pipe/wait on Metal
python ncpu/os/gpu/demo.py --multiproc
# Run real BusyBox on the GPU
python demos/busybox_gpu_demo.py --interactive
# Alpine Linux on GPU
python demos/alpine_gpu.py --demo
# Rust-native launcher --- standalone Rust path (ELF or boot image)
cd kernels/rust_metal
cargo run --bin ncpu_run -- --elf ../../demos/busybox.elf --rootfs -- echo hello
cargo run --bin ncpu_run -- --elf ../../demos/busybox.elf --inspect --json-report
cargo run --bin ncpu_run -- ../../path/to/image.bin
# Benchmark mode --- run 3x with aggregate statistics
cargo run --bin ncpu_run -- --elf ../../demos/busybox.elf --benchmark --rootfs -- echo hello
# Custom repeat count with JSON aggregate output
cargo run --bin ncpu_run -- --elf ../../demos/busybox.elf --repeat 10 --json-report --rootfs -- echo hello
# Differentiable coprocessor --- inject nCPU into a transformer
python ncpu/coprocessor/train.py # Train on synthetic arithmetic + GSM8K
cargo check --bin ncpu_run currently passes in this workspace. Direct cargo run is still subject to the local PyO3/Python link environment.
The Full Stack
| Layer | Implementation | What It Proves |
|-------|---------------|----------------|
| ALU | 13 trained .pt models (neural) or native tensor ops (fast) | Neural nets do exact 32-bit integer arithmetic --- exhaustively verified, 100% accuracy |
| OS | 11 neural models (neurOS), zero fallbacks | Learned MMU, TLB, cache, scheduler, assembler, compiler --- the OS is differentiable |
| GPU Compute | Rust Metal kernel, ~200 ARM64 insns (int + FP) | GPU executes arbitrary programs at ~1.9M IPS, zero-copy StorageModeShared |
| UNIX OS | Compiled C on Metal | Fork/pipe/wait, 25-command shell, 28 syscall
