Picolm
Run a 1-billion parameter LLM on a $10 board with 256MB RAM
Install / Use
/learn @RightNow-AI/PicolmREADME
The Perfect Match: PicoLM + PicoClaw
<div align="center"> <img src="picolm.jpg" alt="PicoLM — Run a 1-billion parameter LLM on a $10 board" width="640"> <br><br> </div>PicoLM was built as the local brain for PicoClaw — an ultra-lightweight AI assistant in Go that runs on $10 hardware. Together, they form a fully offline AI agent — no cloud, no API keys, no internet, no monthly bills.
<table align="center"> <tr align="center"> <td><b>The Hardware</b></td> <td><b>The Architecture</b></td> </tr> <tr> <td align="center"><img src="https://raw.githubusercontent.com/sipeed/picoclaw/main/assets/licheervnano.png" alt="$9.90 LicheeRV Nano" width="360"></td> <td align="center"><img src="https://raw.githubusercontent.com/sipeed/picoclaw/main/assets/arch.jpg" alt="PicoClaw architecture — PicoLM sits in the LLM box" width="420"></td> </tr> <tr> <td align="center"><em>$9.90 — that's the entire server</em></td> <td align="center"><em>PicoLM powers the LLM box in PicoClaw's agent loop</em></td> </tr> </table>Every other LLM provider needs the internet. PicoLM doesn't.
Why they're a perfect fit
| | Cloud Provider (OpenAI, etc.) | PicoLM (Local) | |---|---|---| | Cost | Pay per token, forever | Free forever | | Privacy | Your data sent to servers | Everything stays on-device | | Internet | Required for every request | Not needed at all | | Latency | Network round-trip + inference | Inference only | | Hardware | Needs a $599 Mac Mini | Runs on a $10 board | | Binary | N/A | ~80KB single file | | RAM | N/A | 45 MB total |
How it works
PicoClaw's agent loop spawns PicoLM as a subprocess. Messages come in from Telegram, Discord, or CLI — PicoClaw formats them into a chat template, pipes the prompt to picolm via stdin, and reads the response from stdout. When tools are needed, --json grammar mode guarantees valid JSON even from a 1B model.
Telegram / Discord / CLI
│
▼
┌──────────┐ stdin: prompt ┌───────────┐
│ PicoClaw │ ──────────────────► │ picolm │
│ (Go) │ ◄────────────────── │ (C) │
└──────────┘ stdout: response │ + model │
│ └───────────┘
▼ 45 MB RAM
User gets reply No internet
Quick setup
# 1. Build PicoLM
cd picolm && make native # or: make pi (Raspberry Pi)
# 2. Download model (one-time, 638 MB)
make model
# 3. Build PicoClaw
cd ../picoclaw && make deps && make build
# 4. Configure (~/.picoclaw/config.json)
{
"agents": {
"defaults": {
"provider": "picolm",
"model": "picolm-local"
}
},
"providers": {
"picolm": {
"binary": "~/.picolm/bin/picolm",
"model": "~/.picolm/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
"max_tokens": 256,
"threads": 4,
"template": "chatml"
}
}
}
# 5. Chat — fully offline!
picoclaw agent -m "What is photosynthesis?"
Or install everything in one line
curl -sSL https://raw.githubusercontent.com/RightNow-AI/picolm/main/install.sh | bash
Performance on real hardware
| Device | Price | Generation Speed | RAM Used | |--------|-------|-----------------|----------| | Pi 5 (4-core) | $60 | ~10 tok/s | 45 MB | | Pi 4 (4-core) | $35 | ~8 tok/s | 45 MB | | Pi 3B+ | $25 | ~4 tok/s | 45 MB | | Pi Zero 2W | $15 | ~2 tok/s | 45 MB | | LicheeRV Nano | $10 | ~1 tok/s | 45 MB |
JSON tool calling
PicoClaw automatically activates --json grammar mode when it needs structured output. This guarantees syntactically valid JSON even from a 1B parameter model — essential for reliable tool calling on tiny hardware:
picoclaw agent -m "Search for weather in Tokyo"
# → PicoLM generates: {"tool_calls": [{"function": {"name": "web_search", "arguments": "{\"query\": \"weather Tokyo\"}"}}]}
For the full PicoClaw documentation, see the PicoClaw README.
What is PicoLM?
PicoLM is a minimal, from-scratch LLM inference engine written in ~2,500 lines of C11. It runs TinyLlama 1.1B (and other LLaMA-architecture models in GGUF format) on hardware that most inference frameworks won't even consider:
- Raspberry Pi Zero 2W ($15, 512MB RAM, ARM Cortex-A53)
- Sipeed LicheeRV ($12, 512MB RAM, RISC-V)
- Raspberry Pi 3/4/5 (1-8GB RAM, ARM NEON SIMD)
- Any Linux/Windows/macOS x86-64 machine
The model file (638MB) stays on disk. PicoLM memory-maps it and streams one layer at a time through RAM. Total runtime memory: ~45MB including the FP16 KV cache.
┌──────────────────────────────────────────┐
What goes │ 45 MB Runtime RAM │
in RAM │ ┌─────────┐ ┌──────────┐ ┌───────────┐ │
│ │ Buffers │ │ FP16 KV │ │ Tokenizer │ │
│ │ 1.2 MB │ │ Cache │ │ 4.5 MB │ │
│ │ │ │ ~40 MB │ │ │ │
│ └─────────┘ └──────────┘ └───────────┘ │
└──────────────────────────────────────────┘
┌──────────────────────────────────────────┐
What stays │ 638 MB Model on Disk │
on disk │ (mmap — OS pages in layers │
(via mmap) │ as needed, ~1 at a time) │
└──────────────────────────────────────────┘
Features
| Feature | Description |
|---------|-------------|
| GGUF Native | Reads GGUF v2/v3 files directly — no conversion needed |
| K-Quant Support | Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0, Q4_0, F16, F32 |
| mmap Layer Streaming | Model weights stay on disk; OS pages in one layer at a time |
| FP16 KV Cache | Halves KV cache memory (44MB vs 88MB for 2048 context) |
| Flash Attention | Online softmax — no O(seq_len) attention buffer needed |
| Pre-computed RoPE | cos/sin lookup tables eliminate transcendentals from hot loop |
| SIMD Acceleration | ARM NEON (Pi 3/4/5) and x86 SSE2 (Intel/AMD) auto-detected |
| Fused Dot Products | Dequantize + dot-product in one pass — no intermediate buffer |
| Multi-threaded matmul | Parallel matrix-vector multiply across CPU cores |
| Grammar-Constrained JSON | --json flag forces valid JSON output (for tool calling) |
| KV Cache Persistence | --cache saves/loads prompt state — skip prefill on re-runs |
| BPE Tokenizer | Score-based byte-pair encoding, loaded from GGUF metadata |
| Top-p Sampling | Temperature + nucleus sampling with configurable seed |
| Pipe-friendly | Reads prompts from stdin: echo "Hello" \| ./picolm model.gguf |
| Zero Dependencies | Only libc, libm, libpthread. No external libraries. |
| Cross-platform | Linux, Windows (MSVC), macOS. ARM, x86-64, RISC-V. |
Quick Start
One-liner install (Raspberry Pi / Linux)
curl -sSL https://raw.githubusercontent.com/RightNow-AI/picolm/main/install.sh | bash
This will:
- Detect your platform (ARM64, ARMv7, x86-64)
- Install build dependencies (
gcc,make,curl) - Build PicoLM with optimal SIMD flags for your CPU
- Download TinyLlama 1.1B Q4_K_M (638 MB)
- Run a quick test
- Generate PicoClaw config
- Add
picolmto your PATH
Build from source
git clone https://github.com/rightnow-ai/picolm.git
cd picolm/picolm
# Auto-detect CPU (enables SSE2/AVX on x86, NEON on ARM)
make native
# Download a model
make model
# Run it
./picolm /opt/picolm/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
-p "The meaning of life is" -n 100
Build on Windows (MSVC)
cd picolm
build.bat
picolm.exe model.gguf -p "Hello world" -n 50
Platform-specific builds
make native # x86/ARM auto-detect (recommended for local machine)
make pi # Raspberry Pi 3/4/5 (64-bit ARM + NEON SIMD)
make pi-arm32 # Pi Zero / Pi 1 (32-bit ARM)
make cross-pi # Cross-compile for Pi from x86 (static binary)
make riscv # RISC-V (Sipeed LicheeRV, etc.)
make static # Static binary for single-file deployment
make debug # Debug build with symbols, no optimization
Usage
PicoLM — ultra-lightweight LLM inference engine
Usage: picolm <model.gguf> [options]
Generation options:
-p <prompt> Input prompt (or pipe via stdin)
-n <int> Max tokens to generate (default: 256)
-t <float> Temperature (default: 0.8, 0=greedy)
-k <float> Top-p / nucleus sampling (default: 0.9)
-s <int> RNG seed (default: 42)
-c <int> Context length override
-j <int> Number of threads (default: 4)
Advanced options:
--json Grammar-constrained JSON output mode
--cache <file> KV cache file (saves/loads prompt state)
Examples
Basic generation:
./picolm model.gguf -p "Once upon a time" -n 200
**Greedy decoding (deterministic, temperature=
