Cactus
Low-latency AI engine for mobile devices & wearables
Install / Use
/learn @cactus-compute/CactusREADME
Cactus
<img src="assets/banner.jpg" alt="Logo" style="border-radius: 30px; width: 100%;">[![Docs][docs-shield]][docs-url] [![Website][website-shield]][website-url] [![GitHub][github-shield]][github-url] [![HuggingFace][hf-shield]][hf-url] [![Reddit][reddit-shield]][reddit-url] [![Blog][blog-shield]][blog-url]
A low-latency AI engine for mobile devices & wearables. Main features:
- Fast: fastest inference on ARM CPU
- Low RAM: zero-copy memory mapping ensures 10x lower RAM use than other engines
- Multimodal: one SDK for speech, vision, and language models
- Cloud fallback: automatically route requests to cloud models if needed
- Energy-efficient: NPU-accelerated prefill
┌─────────────────┐
│ Cactus Engine │ ←── OpenAI-compatible APIs for all major languages
└─────────────────┘ Chat, vision, STT, RAG, tool call, cloud handoff
│
┌─────────────────┐
│ Cactus Graph │ ←── Zero-copy computation graph (PyTorch for mobile)
└─────────────────┘ Custom models, optimised for RAM & quantisation
│
┌─────────────────┐
│ Cactus Kernels │ ←── ARM SIMD kernels (Apple, Snapdragon, Exynos, etc)
└─────────────────┘ Custom attention, KV-cache quant, chunked prefill
Quick Demo (Mac)
- Step 1:
brew install cactus-compute/cactus/cactus - Step 2:
cactus transcribeorcactus run
Cactus Engine
#include "cactus.h"
cactus_model_t model = cactus_init(
"path/to/weight/folder",
"path to txt or dir of txts for auto-rag",
false
);
const char* messages = R"([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "My name is Henry Ndubuaku"}
])";
const char* options = R"({
"max_tokens": 50,
"stop_sequences": ["<|im_end|>"]
})";
char response[4096];
int result = cactus_complete(
model, // model handle
messages, // JSON chat messages
response, // response buffer
sizeof(response), // buffer size
options, // generation options
nullptr, // tools JSON
nullptr, // streaming callback
nullptr // user data
);
Example response from Gemma3-270m
{
"success": true, // generation succeeded
"error": null, // error details if failed
"cloud_handoff": false, // true if cloud model used
"response": "Hi there!",
"function_calls": [], // parsed tool calls
"confidence": 0.8193, // model confidence
"time_to_first_token_ms": 45.23,
"total_time_ms": 163.67,
"prefill_tps": 1621.89,
"decode_tps": 168.42,
"ram_usage_mb": 245.67,
"prefill_tokens": 28,
"decode_tokens": 50,
"total_tokens": 78
}
Cactus Graph
#include "cactus.h"
CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);
auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);
float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);
graph.execute();
void* output_data = graph.get_output(result);
graph.hard_reset();
API & SDK References
| Reference | Language | Description | |-----------|----------|-------------| | Engine API | C | Chat completion, streaming, tool calling, transcription, embeddings, RAG, vision, VAD, vector index, cloud handoff | | Graph API | C++ | Tensor operations, matrix multiplication, attention, normalization, activation functions | | Python SDK | Python | Mac, Linux | | Swift SDK | Swift | iOS, macOS, tvOS, watchOS, Android | | Kotlin SDK | Kotlin | Android, iOS (via KMP) | | Flutter SDK | Dart | iOS, macOS, Android | | Rust SDK | Rust | Mac, Linux | | React Native | JavaScript | iOS, Android |
Model weights: Pre-converted weights for all supported models at huggingface.co/Cactus-Compute.
Benchmarks (CPU-only, no GPU)
- All weights INT4 quantised
- LFM: 1k-prefill / 100-decode, values are prefill tps / decode tps
- LFM-VL: 256px input, values are latency / decode tps
- Parakeet: 20s audio input, values are latency / decode tps
- Missing latency = no NPU support yet
| Device | LFM 1.2B | LFMVL 1.6B | Parakeet 1.1B | VL RAM Usage | |--------|----------|------------|---------------|-----| | Mac M4 Pro | 582/100 | 0.2s/98 | 0.1s/900k+ | 76MB | | iPad/Mac M3 | 350/60 | 0.3s/69 | 0.3s/800k+ | 70MB | | iPhone 17 Pro | 327/48 | 0.3s/48 | 0.3s/300k+ | 108MB | | iPhone 13 Mini | 148/34 | 0.3s/35 | 0.7s/90k+ | 1GB | | Galaxy S25 Ultra | 255/37 | -/34 | -/250k+ | 1.5GB | | Pixel 6a | 70/15 | -/15 | -/17k+ | 1GB | | Galaxy A17 5G | 32/10 | -/11 | -/40k+ | 727MB | | CMF Phone 2 Pro | - | - | - | - | | Raspberry Pi 5 | 69/11 | 13.3s/11 | 4.5s/180k+ | 869MB |
Supported Transcription Model
- STT: 20s audio input on Macbook Air M3 chip
- Benchmark dataset: internal evals with production users
| Model | Params | End2End ms | Latency ms | Decode toks/sec | NPU | RTF | WER | |-------|--------|------------|------------|------------|-----|-----|-----| | UsefulSensors/moonshine-base | 61M | 361 | 182 | 262 | yes | 0.0180 | 0.1395 | | openai/whisper-tiny | 39M | 232 | 137 | 581 | yes | 0.0116 | 0.1860 | | openai/whisper-base | 74M | 329 | 178 | 358 | yes | 0.0164 | 0.1628 | | openai/whisper-small | 244M | 856 | 332 | 108 | yes | 0.0428 | 0.0930 | | openai/whisper-medium | 769M | 2085 | 923 | 49 | yes | 0.1041 | 0.0930 | | nvidia/parakeet-ctc-0.6b | 600M | 201 | 201 | 5214285 | yes | 0.0101 | 0.0930 | | nvidia/parakeet-tdt-0.6b-v3 | 600M | 718 | 718 | 3583333 | yes | 0.0359 | 0.0465 | | nvidia/parakeet-ctc-1.1b | 1.1B | 279 | 278 | 4562500 | yes | 0.0139 | 0.1628 | | snakers4/silero-vad | - | - | - | - | - | - | - |
Supported LLMs
- Gemma weights are often gated on HuggingFace, needs tokens
- Run
huggingface-cli loginand input your huggingface token
| Model | Features |
|-------|----------|
| google/gemma-3-270m-it | completion |
| google/functiongemma-270m-it | tools |
| google/gemma-3-1b-it | completion, gated |
| google/gemma-3n-E2B-it | completion, tools |
| google/gemma-3n-E4B-it | completion, tools |
| Qwen/Qwen3-0.6B | completion, tools, embed |
| Qwen/Qwen3-Embedding-0.6B | embed |
| Qwen/Qwen3.5-0.8B | vision, completion, tools, embed |
| Qwen/Qwen3-1.7B | completion, tools, embed |
| Qwen/Qwen3.5-2B | vision, completion, tools, embed |
| LiquidAI/LFM2-350M | completion, tools, embed |
| LiquidAI/LFM2-700M | completion, tools, embed |
| LiquidAI/LFM2-8B-A1B | completion, tools, embed |
| LiquidAI/LFM2.5-1.2B-Thinking | completion, tools, embed |
| LiquidAI/LFM2.5-1.2B-Instruct | completion, tools, embed |
| LiquidAI/LFM2-2.6B | completion, tools, embed |
| LiquidAI/LFM2-VL-450M | vision, txt & img embed, Apple NPU |
| LiquidAI/LFM2.5-VL-1.6B | vision, txt & img embed, Apple NPU |
| tencent/Youtu-LLM-2B | completion, tools, embed |
| nomic-ai/nomic-embed-text-v2-moe | embed |
Roadmap
| Date | Status | Milestone | |------|--------|-----------| | Sep 2025 | Done | Released v1 | | Oct 2025 | Done | Chunked prefill, KVCache Quant (2x prefill) | | Nov 2025 | Done | Cactus Attention (10 & 1k prefill = same decode) | | Dec 2025 | Done | Team grows to +6 Research Engineers | | Jan 2026 | Done | Apple NPU/RAM, 5-11x faster iOS/Mac | | Feb 2026 | Done | Hybrid inference, INT4, lossless Quant (1.5x) | | Mar 2026 | Coming | Qualcomm/Google NPUs, 5-11x faster Android | | Apr 2026 | Coming | Mediatek/Exynos NPUs, Cactus@ICLR | | May 2026 | Coming | Wearables & custom chips optimisations | | Jun 2026 | Coming | Torch/JAX model transpilers |
Using this repo
┌──────────────────────────────────────────────────────────────────────────────┐
│ │
│ Step 0: if on Linux (Ubuntu/Debian) │
│ sudo apt-get install python3 python3-venv python3-pip cmake │
│ build-essential libcurl4-openssl-dev │
│ │
│ Step 1: clone and setup │
│ git clone https://github.com/cactus-compute/cactus && cd cactus │
│ source ./setup │
│ │
│ Step 2: use the commands │
│──────────────────────────────────────────────────────────────────────────────│
│ │
│ cactus auth manage Cloud API key │
│ --status show key status │
│ --clear remove saved key │
│ │
│ cactus run <model> opens playground (auto downloads) │
│ --precision INT4|INT8|FP16 quantization (default: INT4) │
│ --token <token> HF token (gated models) │
│ --reconvert force reconversion from source │
│ │
│ cactus transcribe [model] live mic transcription (parakeet-1.1b) │
│ --file <audio.wav> transcribe file instead of mic │
│ --precision INT4|INT8|FP16
Related Skills
node-connect
334.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
334.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.2kCommit, push, and open a PR
