AdderBoard
Smallest transformer that can add two 10-digit numbers
Install / Use
/learn @anadim/AdderBoardREADME
AdderBoard
<p align="center"> <img src="adderboard.png" width="500" alt="AdderBoard"> </p>Challenge: Build the smallest transformer that can add two 10-digit numbers with >= 99% accuracy on a held-out 10K test set.
This started with Addition Under Pressure, where I gave Claude Code and Codex the same prompt: train the smallest possible transformer that can do 10-digit addition with at least 99% accuracy. Claude Code came back with 6,080 parameters and Codex came back with 1,644. The community has since pushed this dramatically lower.
Maintained by Dimitris Papailiopoulos (@dimitrispapail).
We track two categories:
- Trained — weights learned from data by any training algorithm (SGD, Adam, evolutionary search, etc.). The algorithm must be generic — it should work with any model and dataset, not just this specific problem. This encourages creative ideas around data format, tokenization, curriculum learning, and architecture search.
- Hand-coded — weights set analytically. This is a constructive proof that the architecture can represent addition, regardless of whether SGD would find it.
Both are valid. Both are interesting.
Leaderboard
Hand-Coded Weights (Constructive Proofs)
| Rank | Params | Accuracy | Author | Built with | Architecture | Key Tricks | Link | |------|--------|----------|--------|------------|-------------|------------|------| | 1 | 6* | 100% | zcbtrak | | 1L Qwen-derived decoder, d=2, 1h, hd=2, ff=2 (fixed Q, folded norm) | RoPE period-19, hardcoded Q_proj (PE exemption), norm weights folded into tied output head, tied carry hinge gate, shared carry-scale scalar | gist | | 2 | 8 | 100% | kswain98 | | 1L Qwen-style decoder, d=2, 1h, hd=2, ff=2 | RoPE period-19, phase-tied Q projection (1 param), coupled quadratic embedding (1 param), tied carry hinge gate, shared carry-scale scalar | repo | | 3 | 10 | 100% | lokimorty | | 1L Qwen-derived decoder, d=2, 1h, hd=2, ff=2 | RoPE period-19, parametric tied embedding, gate tying via algebraic identity, merged carry scalar | gist | | 4 | 12 | 100% | lokimorty | | 1L Qwen-derived decoder, d=2, 1h, hd=2, ff=2 | RoPE period-19, parametric tied embedding, sparse attention/MLP, constructive carry hinge | gist | | 5 | 20 | 100% | yieldthought | | 1L decoder, d=2, 1h, hd=2 | Quadratic tied embedding + tied output head, RoPE-19 digit routing, sparse tied V/O, two-hinge ReLU MLP, parameterless pre-norm | gist | | 6 | 27 | 100% | Wonderfall (@w0nderfall) | | 1L decoder, d=2, 1h, hd=2 | Tied Q/K + V/O, cross-tied W_vo as MLP w2, factorized quadratic embedding, compressed MLP w1, RoPE period-19 | gist | | 7 | 28 | 100% | jacobli99 | | 1L decoder, d=2, 5h (MQA), hd=2, ff=4 | Tied parabolic decode, RoPE digit routing, sparse O-proj, tied MLP, matrix broadcast | gist | | 8 | 31 | 100% | Arch222 | | 1L decoder, d=3, 4h/1kv, hd=2, ff=4 | RoPE offset-targeted queries, sparse O-proj, SwiGLU carry detection, tied embed decode | repo | | 9 | 33 | 100% | fblissjr | Claude Code + Gemini | 1L decoder, d=3, 3h (d_head=1), ff=4 | ALiBi prefix sum for carry, e^80 softmax anchoring, residual cancellation head, 2-hinge ReLU step, parabolic LM head, float64 | repo | | 10 | 36 | 100% | alexlitz | | 2L decoder, d=5, 5h+1h | ALiBi slope=log(10) for base-10 weighting, sparse embed, gated ReLU FFN, float64 | gist | | 11 | 50 | 100% | lichengliu03 | | 1L custom GPT, d=4, 2h, hd=2 | Factorized embed, rotation Q (2 angles), tied embed+V dir, rank-1 MLP, parabolic head, sinusoidal PE (period 11) | repo | | 12 | 66 | 100% | cosminscn | | 1L nanoGPT, d=4, 2h | Rotation Q (2 angles), sparse c_proj (2 nonzero), parabolic lm_head, factorized embed, sinusoidal PE (period 11) | gist | | 13 | 87 | 100% | bingbangboom-lab | | 2L Qwen3, d=5, 2h/1kv, hd=2, ff=3 | Cross-layer sharing, rank-1 projections, sparse gate, low-rank head, frozen scaling params | gist | | 14 | 93 | 100% | jacobli99 | | 1L decoder, d=2, 5h (MQA), hd=2, ff=4 | Tied parabolic decode, RoPE digit routing, ReLU carry detection | gist | | 15 | 111 | 100% | corbensorenson | Codex | 1L decoder, d=3, 4h/1kv, hd=2, ff=2 | Tied embed, RoPE, SwiGLU, GQA | repo | | 16 | 116 | 100% | nino | | 1L Qwen3, d=3, 4h/1kv, hd=2 | Tied embed, shared RMSNorm vectors, RoPE (hd=2) | gist | | 17 | 121 | 100% | Wonderfall (@w0nderfall) | Codex | 1L Qwen3, d=3, 4h/1kv, hd=2, ff=2 | Tied embed, RoPE digit routing, carry via final norm, SiLU wrap detection | gist | | 18 | 130 | 100% | cosminscn | | 1L nanoGPT, d=4, 2h | Rank-1 linear, factorized embed, sinusoidal PE (period 11), ReLU carry detection, parabolic logit decoding | gist | | 19 | 130 | 100% | Wonderfall (@w0nderfall) | Codex | 1L Qwen3, d=3, 4h/1kv, hd=2, ff=3 | Tied embed, RoPE digit routing, SiLU carry logic | gist | | 20 | 139 | 100% | Wonderfall (@w0nderfall) | GPT-5.2 Pro + Codex | 1L Qwen3, d=3, 4h/1kv, hd=2 | Tied embed, RoPE digit routing, SiLU carry logic | gist | | 21 | 148 | 100% | bingbangboom-lab | | 2L Qwen3, d=5, 2h/1kv, hd=2, ff=3 | Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head, cross-layer sharing | gist | | 22 | 177 | 100% | xangma (@xangma) | GPT + Codex | 2L Qwen3, d=5, 2h/1kv, hd=2 | Rank-1 linear, factorized embed, sparse gate, param-free norm, low-rank head | gist | | 23 | 197 | ~100%** | xangma (@xangma) | GPT + Codex | 2L Qwen3, d=5, 2h/1kv, hd=2 | Rank-1 linear, factorized embed, sparse gate, param-free norm | gist |
* Parameter count debated: the 6 counted parameters sit within an architecture that has 4 additional hardcoded weight values (Q projection and RMSNorm weights) that were counted as parameters in the parent 10p submission. Under strict counting this model has 10 unique weight values; under the submitter's accounting, 6. See #75 for discussion. We may be approaching the practical minimum for this architecture family.
** Passed 8,192 random tests; not independently verified on our 10K test suite yet.
Trained Weights (Learned from Data)
| Rank | Params | Accuracy | Author | Built with | Architecture | Key Tricks | Link | |------|--------|----------|--------|------------|-------------|------------|------| | 1 | 36 | 100% | tbukic | SuperchargeAI + Claude Code | 1L Qwen3, d=3, 1h/1kv, hd=4, ff=2, RoPE θ=3, SwiGLU | Circular arc embedding (3 params), K=rotation(Q), V=Q, tied O=Q^T, all RMSNorms shared, tied QK norms, down=rotation(up^T) | repo | | 2 | 39 | 99.91% | lokimorty | | 1L Qwen3, d=3, 1h/1kv, hd=4, ff=2, RoPE θ=3, SwiGLU | Circular arc embedding, tied K=V, tied Q/O readout, shared RMSNorms, shared anti-quarter QK norm, repeat-mix shared block | gist | | 3 | 41 | 100% | tbukic | SuperchargeAI + Claude Code | 1L Qwen3, d=3, 1h/1kv, hd=4, ff=2, RoPE θ=3, SwiGLU | Circular arc embedding (3 params), K=rotation(Q), V=Q, tied O=Q^T, all RMSNorms shared, tied QK norms | repo | | 4 | 44 | 100% | tbukic
