EEmicroGPT

"This file is the complete algorithm. Everything else is just efficiency." — Andrej Karpathy, microgpt.py

This file is the everything else.

EEmicroGPT is a single-file, dependency-free C implementation of GPT training — forward pass, backward pass, Adam optimizer, and autoregressive generation — optimized from the ground up for Apple Silicon. It trains a character-level name generator on the same architecture and dataset as Karpathy's microgpt.py, producing identical learning dynamics up to 19,000x faster per training sample.

The name stands for "Everything Else" (or "Extreme Efficiency") — the half of the equation that microgpt.py intentionally leaves on the table.

Quick start

# Scalar Neon path (any Apple Silicon Mac)
clang -O3 -ffast-math -o eemicrogpt eemicrogpt.c -lm
./eemicrogpt

# SME2 path (M4/M5+, ~2x faster at d_model>=64)
clang -O3 -mcpu=native+sme2 -ffast-math -o eemicrogpt eemicrogpt.c -lm
./eemicrogpt

Requires names.txt in the working directory (the Karpathy names dataset, ~32K names).

Configurable at compile time:

# Smaller/faster model
clang -O3 -ffast-math -DD_MODEL=16 -DN_STEPS=5000 -DLR_INIT=0.008 -o eemicrogpt eemicrogpt.c -lm

# Larger model with SME2
clang -O3 -mcpu=native+sme2 -ffast-math -DD_MODEL=128 -DN_HEADS=8 -DN_STEPS=10000 -DLR_INIT=0.003 -o eemicrogpt eemicrogpt.c -lm

Architecture

A 1-layer GPT matching Karpathy's microgpt exactly:

| Component | Details | |---|---| | Layers | 1 transformer block | | d_model | 64 (configurable: 16, 32, 64, 128) | | Heads | 4 (configurable) | | d_ff | 4 * d_model | | Vocab | 27 (a-z + boundary token) | | Max seq | 16 | | Norm | RMS norm (pre-attention, pre-FFN) | | Activation | ReLU | | Optimizer | Adam (beta1=0.85, beta2=0.99) | | LR schedule | Linear decay to zero |

The forward pass: embed → rms_norm → rms_norm → QKV → causal attention → O proj + residual → rms_norm → FFN (expand → ReLU → contract) + residual → LM head → softmax → cross-entropy loss

Performance

All benchmarks on Apple M5, single P-core.

bpc@1s: best quality within ~1 second of training (batch=16)

| d_model | backend | us/step | steps/1s | loss | |---------|---------|---------|----------|------| | 16 | scalar | 57.1 | 16,700 | 2.0869 | | 32 | scalar | 181 | 5,150 | 2.0747 | | 64 | scalar | 832 | 1,100 | 2.1384 | | 64 | SME2 | 589 | 1,700 | 2.0974 | | 128 | scalar | 4,779 | 220 | 2.2633 | | 128 | SME2 | 1,904 | 545 | 2.1645 |

Winner: d32 scalar at LR=0.007 → loss 2.0747

d32 hits the sweet spot: enough capacity to learn well, small enough to run thousands of steps per second. At d16 the model is capacity-limited; at d64+ the per-step cost eats into the time budget. The fused streaming-mode backward (optimization #10) gives the d64 SME2 path 200 extra steps, closing the gap with d32.

Convergence reference (long runs, tuned LR)

| d_model | steps | LR | loss | |---------|-------|-----|------| | 16 | 100k | 0.006 | ~2.06 (capacity floor) | | 32 | 500k | 0.002 | 1.92 | | 64 (SME2) | 1M | 0.0007 | 1.74 (~10 min wall time) |

Work-equivalent comparison (per training sample, same architecture)

All implementations train the exact same model on the same data. CPython, PyPy, and microgpt.cpp use the autograd Value class approach with batch=1 and f64. rust-microgpt uses the same autograd approach with batch=1 but f32. EEmicroGPT uses explicit forward/backward, batch=16, f32.

d16 (n_embd=16, block_size=16, 10K training samples):

| Implementation | Wall time | us/sample | Speedup | |---|---|---|---| | CPython 3.14 | 490s | 49,000 | 1x | | PyPy 7.3.17 | 176.4s | 17,640 | 2.8x | | microgpt.cpp | 2.70s | 270 | 181x | | rust-microgpt | 1.18s | 118 | 415x | | EEmicroGPT | 0.030s | 3.0 | 16,333x |

d64 (n_embd=64, block_size=16, 1K training samples):

| Implementation | Wall time | us/sample | Speedup | |---|---|---|---| | CPython 3.14 | 713s | 713,200 | 1x | | PyPy 7.3.17 | 301.4s | 301,400 | 2.4x | | microgpt.cpp | 3.26s | 3,260 | 219x | | rust-microgpt | 1.62s | 1,620 | 440x | | EEmicroGPT scalar | 0.047s | 46.8 | 15,239x | | EEmicroGPT SME2 | 0.037s | 36.8 | 19,380x |

~44x faster than rust-microgpt (the fastest autograd implementation) at both sizes. The gap comes from batched GEMM (16 samples amortize weight loads), float vs double, explicit gradients (no autograd graph), and Neon/SME2 SIMD.

Why this likely beats a $40K GPU

This isn't just faster than Python — a single M5 P-core likely beats an NVIDIA B200 Blackwell for this workload. Here's the math.

The compute is trivial

Total FLOPs per training step (forward + backward + Adam), with batch=16 and ~104 active positions:

| d_model | FLOPs/step | params | weight memory | |---------|-----------|--------|---------------| | 16 | 2.3M | 4,192 | 16 KB | | 32 | 8.5M | 14,528 | 57 KB | | 64 | 32.5M | 53,632 | 210 KB | | 128 | 127M | 205,568 | 803 KB |

The B200 delivers 90 TFLOPS FP32. At peak throughput, these steps would take:

| d_model | B200 @ 100% | B200 @ 5% | EEmicroGPT (M5) | |---------|-------------|-----------|-----------------| | 16 | 0.026 us | 0.5 us | 47 us | | 32 | 0.094 us | 1.9 us | 182 us | | 64 | 0.36 us | 7.2 us | 589 us | | 128 | 1.4 us | 28 us | 1,904 us |

If the B200 could sustain even 5% utilization, it would crush us. It can't. The problem isn't compute — it's everything around the compute.

The killer microsecond problem

A training step requires ~40 distinct GPU kernel launches: 7 forward GEMMs, 6 backward input-grad GEMMs, 7 weight-grad GEMMs, plus ~20 non-GEMM kernels (embeddings, 6 RMS norms, attention, ReLU, residuals, softmax, loss, Adam).

Each kernel launch has overhead. Measured CUDA launch latency is ~10–20 us host-to-device; back-to-back kernel-to-kernel gaps are ~3–5 us even with CUDA graphs. This is what NVIDIA calls the "killer microsecond" problem — Blackwell executes small kernels faster than the overhead of launching them.

| Scenario | Kernel overhead | Compute | Total | vs M5 d32 | |----------|---------------|---------|-------|-----------| | CUDA graphs (best case) | 40 × 3 us = 120 us | ~10 us | ~130 us | 0.7x | | Async launch (typical) | 40 × 10 us = 400 us | ~10 us | ~410 us | 2.3x slower | | PyTorch (framework tax) | 40 × 15 us = 600 us | ~10 us | ~610 us | 3.4x slower |

At d32, even the most optimistic GPU scenario (CUDA graphs with custom CUDA kernels) is roughly tied with a single M5 P-core. In practice, nobody writes custom CUDA training loops with CUDA graphs for a 57 KB model — they use PyTorch, which adds ~15 us per operation for Python dispatch, autograd graph construction, and memory management.

GPU utilization is <5% for these matrices

Our largest GEMM is FFN expand at d128: [512, 128] @ [128, 104] — that's 53K output elements. cuBLAS tiles this into 128×128 thread blocks: ceil(512/128) × ceil(104/128) = 4 × 1 = 4 thread blocks. The B200 has 148 SMs. Four of them do useful work; 144 sit idle.

At d32, FFN expand is [128, 32] @ [32, 104] — one or two thread blocks on 148 SMs. Under 1.5% utilization. The GPU's 20,480 CUDA cores and 8 TB/s HBM3e bandwidth are designed for matrices with M, N, K > 1024. Our matrices are 100x too small.

CUDA graphs don't fully help

CUDA graphs pre-record a kernel sequence to eliminate host-side launch overhead. But they require static shapes — every tensor dimension must be fixed at graph capture time. Our variable-length sequences (names average ~6.5 chars but range from 2 to 16) break this: we'd need to pad everything to MAX_SEQ=16, throwing away the 41% savings from padding skip, or capture multiple graphs for different lengths.

The cache advantage

On the M5, our entire training state fits close to the compute:

| d_model | Weights | L1 hit? | Total state | L2 hit? | |---------|---------|---------|-------------|---------| | 16 | 16 KB | yes (128 KB L1) | ~180 KB | yes (32 MB L2) | | 32 | 57 KB | yes | ~630 KB | yes | | 64 | 210 KB | partial | ~2.3 MB | yes | | 128 | 803 KB | no | ~8.8 MB | yes |

The M5 L1 serves data at 3-cycle latency (~1.5 ns). The B200's L1 is 39 cycles / ~20 ns per SM, and its L2 is ~150 ns. More importantly, GPU data in L2 doesn't persist across kernel boundaries — each of the 40 kernels reloads from HBM or L2 cache. CPU data stays in cache across the entire training step because there's no kernel boundary, no context switch, no scheduler — just one thread running straight-line code.

Energy: 67x more efficient

The B200 draws 1000W TDP. The M5 MacBook runs this workload at ~15W.

| d_model | M5 time | M5 energy | B200 time (best) | B200 energy | M5 efficiency | |---------|---------|-----------|-------------------|-------------|---------------| | 32 | 182 us | 2.7 mJ | ~130 us | 130 mJ | 48x | | 64 | 589 us | 8.8 mJ | ~140 us | 140 mJ | 16x | | 128 | 1,904 us | 29 mJ | ~170 us | 170 mJ | 5.9x |

Even at d128 where the B200 wins on wall time, the M5 uses 6x less energy per training step. The GPU pays 1000W whether its 148 SMs are busy or idle.

Where GPU wins

The crossover is real — and lower than these theoretical estimates suggest. Our measurements on the same M5 show th

Eemicrogpt

Install / Use

README