Cactus

Low-latency AI engine for mobile devices & wearables

Generate Convert Improve

Install / Use

/learn @cactus-compute/Cactus

About this skill

Quality Score

0/100

README

Cactus

[![Docs][docs-shield]][docs-url] [![Website][website-shield]][website-url] [![GitHub][github-shield]][github-url] [![HuggingFace][hf-shield]][hf-url] [![Reddit][reddit-shield]][reddit-url] [![Blog][blog-shield]][blog-url]

A low-latency AI engine for mobile devices & wearables. Main features:

Fast: fastest inference on ARM CPU
Low RAM: zero-copy memory mapping ensures 10x lower RAM use than other engines
Multimodal: one SDK for speech, vision, and language models
Cloud fallback: automatically route requests to cloud models if needed
Energy-efficient: NPU-accelerated prefill

┌─────────────────┐
│  Cactus Engine  │ ←── OpenAI-compatible APIs for all major languages
└─────────────────┘     Chat, vision, STT, RAG, tool call, cloud handoff
         │
┌─────────────────┐
│  Cactus Graph   │ ←── Zero-copy computation graph (PyTorch for mobile)
└─────────────────┘     Custom models, optimised for RAM & quantisation
         │
┌─────────────────┐
│ Cactus Kernels  │ ←── ARM SIMD kernels (Apple, Snapdragon, Exynos, etc)
└─────────────────┘     Custom attention, KV-cache quant, chunked prefill

Quick Demo (Mac)

Step 1: brew install cactus-compute/cactus/cactus
Step 2: cactus transcribe or cactus run

Cactus Engine

#include "cactus.h"

cactus_model_t model = cactus_init(
    "path/to/weight/folder",
    "path to txt or dir of txts for auto-rag",
    false
);

const char* messages = R"([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "My name is Henry Ndubuaku"}
])";

const char* options = R"({
    "max_tokens": 50,
    "stop_sequences": ["<|im_end|>"]
})";

char response[4096];
int result = cactus_complete(
    model,            // model handle
    messages,         // JSON chat messages
    response,         // response buffer
    sizeof(response), // buffer size
    options,          // generation options
    nullptr,          // tools JSON
    nullptr,          // streaming callback
    nullptr           // user data
);

Example response from Gemma3-270m

{
    "success": true,        // generation succeeded
    "error": null,          // error details if failed
    "cloud_handoff": false, // true if cloud model used
    "response": "Hi there!",
    "function_calls": [],   // parsed tool calls
    "confidence": 0.8193,   // model confidence
    "time_to_first_token_ms": 45.23,
    "total_time_ms": 163.67,
    "prefill_tps": 1621.89,
    "decode_tps": 168.42,
    "ram_usage_mb": 245.67,
    "prefill_tokens": 28,
    "decode_tokens": 50,
    "total_tokens": 78
}

Cactus Graph

#include "cactus.h"

CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);

auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);

float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);

graph.execute();
void* output_data = graph.get_output(result);

graph.hard_reset();

API & SDK References

| Reference | Language | Description | |-----------|----------|-------------| | Engine API | C | Chat completion, streaming, tool calling, transcription, embeddings, RAG, vision, VAD, vector index, cloud handoff | | Graph API | C++ | Tensor operations, matrix multiplication, attention, normalization, activation functions | | Python SDK | Python | Mac, Linux | | Swift SDK | Swift | iOS, macOS, tvOS, watchOS, Android | | Kotlin SDK | Kotlin | Android, iOS (via KMP) | | Flutter SDK | Dart | iOS, macOS, Android | | Rust SDK | Rust | Mac, Linux | | React Native | JavaScript | iOS, Android |

Model weights: Pre-converted weights for all supported models at huggingface.co/Cactus-Compute.

Benchmarks (CPU-only, no GPU)

All weights INT4 quantised
LFM: 1k-prefill / 100-decode, values are prefill tps / decode tps
LFM-VL: 256px input, values are latency / decode tps
Parakeet: 20s audio input, values are latency / decode tps
Missing latency = no NPU support yet

| Device | LFM 1.2B | LFMVL 1.6B | Parakeet 1.1B | VL RAM Usage | |--------|----------|------------|---------------|-----| | Mac M4 Pro | 582/100 | 0.2s/98 | 0.1s/900k+ | 76MB | | iPad/Mac M3 | 350/60 | 0.3s/69 | 0.3s/800k+ | 70MB | | iPhone 17 Pro | 327/48 | 0.3s/48 | 0.3s/300k+ | 108MB | | iPhone 13 Mini | 148/34 | 0.3s/35 | 0.7s/90k+ | 1GB | | Galaxy S25 Ultra | 255/37 | -/34 | -/250k+ | 1.5GB | | Pixel 6a | 70/15 | -/15 | -/17k+ | 1GB | | Galaxy A17 5G | 32/10 | -/11 | -/40k+ | 727MB | | CMF Phone 2 Pro | - | - | - | - | | Raspberry Pi 5 | 69/11 | 13.3s/11 | 4.5s/180k+ | 869MB |

Supported Transcription Model

STT: 20s audio input on Macbook Air M3 chip
Benchmark dataset: internal evals with production users

| Model | Params | End2End ms | Latency ms | Decode toks/sec | NPU | RTF | WER | |-------|--------|------------|------------|------------|-----|-----|-----| | UsefulSensors/moonshine-base | 61M | 361 | 182 | 262 | yes | 0.0180 | 0.1395 | | openai/whisper-tiny | 39M | 232 | 137 | 581 | yes | 0.0116 | 0.1860 | | openai/whisper-base | 74M | 329 | 178 | 358 | yes | 0.0164 | 0.1628 | | openai/whisper-small | 244M | 856 | 332 | 108 | yes | 0.0428 | 0.0930 | | openai/whisper-medium | 769M | 2085 | 923 | 49 | yes | 0.1041 | 0.0930 | | nvidia/parakeet-ctc-0.6b | 600M | 201 | 201 | 5214285 | yes | 0.0101 | 0.0930 | | nvidia/parakeet-tdt-0.6b-v3 | 600M | 718 | 718 | 3583333 | yes | 0.0359 | 0.0465 | | nvidia/parakeet-ctc-1.1b | 1.1B | 279 | 278 | 4562500 | yes | 0.0139 | 0.1628 | | snakers4/silero-vad | - | - | - | - | - | - | - |

Supported LLMs

Gemma weights are often gated on HuggingFace, needs tokens
Run huggingface-cli login and input your huggingface token

| Model | Features |
|-------|----------| | google/gemma-3-270m-it | completion | | google/functiongemma-270m-it | tools | | google/gemma-3-1b-it | completion, gated | | google/gemma-3n-E2B-it | completion, tools | | google/gemma-3n-E4B-it | completion, tools | | Qwen/Qwen3-0.6B | completion, tools, embed | | Qwen/Qwen3-Embedding-0.6B | embed | | Qwen/Qwen3.5-0.8B | vision, completion, tools, embed | | Qwen/Qwen3-1.7B | completion, tools, embed | | Qwen/Qwen3.5-2B | vision, completion, tools, embed | | LiquidAI/LFM2-350M | completion, tools, embed | | LiquidAI/LFM2-700M | completion, tools, embed | | LiquidAI/LFM2-8B-A1B | completion, tools, embed | | LiquidAI/LFM2.5-1.2B-Thinking | completion, tools, embed | | LiquidAI/LFM2.5-1.2B-Instruct | completion, tools, embed | | LiquidAI/LFM2-2.6B | completion, tools, embed | | LiquidAI/LFM2-VL-450M | vision, txt & img embed, Apple NPU | | LiquidAI/LFM2.5-VL-1.6B | vision, txt & img embed, Apple NPU | | tencent/Youtu-LLM-2B | completion, tools, embed | | nomic-ai/nomic-embed-text-v2-moe | embed |

Roadmap

| Date | Status | Milestone | |------|--------|-----------| | Sep 2025 | Done | Released v1 | | Oct 2025 | Done | Chunked prefill, KVCache Quant (2x prefill) | | Nov 2025 | Done | Cactus Attention (10 & 1k prefill = same decode) | | Dec 2025 | Done | Team grows to +6 Research Engineers | | Jan 2026 | Done | Apple NPU/RAM, 5-11x faster iOS/Mac | | Feb 2026 | Done | Hybrid inference, INT4, lossless Quant (1.5x) | | Mar 2026 | Coming | Qualcomm/Google NPUs, 5-11x faster Android | | Apr 2026 | Coming | Mediatek/Exynos NPUs, Cactus@ICLR | | May 2026 | Coming | Wearables & custom chips optimisations | | Jun 2026 | Coming | Torch/JAX model transpilers |

Using this repo

┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│ Step 0: if on Linux (Ubuntu/Debian)                                          │
│ sudo apt-get install python3 python3-venv python3-pip cmake                  │
│   build-essential libcurl4-openssl-dev                                       │
│                                                                              │
│ Step 1: clone and setup                                                      │
│ git clone https://github.com/cactus-compute/cactus && cd cactus              │
│ source ./setup                                                               │
│                                                                              │
│ Step 2: use the commands                                                     │
│──────────────────────────────────────────────────────────────────────────────│
│                                                                              │
│  cactus auth                         manage Cloud API key                    │
│    --status                          show key status                         │
│    --clear                           remove saved key                        │
│                                                                              │
│  cactus run <model>                  opens playground (auto downloads)       │
│    --precision INT4|INT8|FP16        quantization (default: INT4)            │
│    --token <token>                   HF token (gated models)                 │
│    --reconvert                       force reconversion from source          │
│                                                                              │
│  cactus transcribe [model]           live mic transcription (parakeet-1.1b)  │
│    --file <audio.wav>                transcribe file instead of mic          │
│    --precision INT4|INT8|FP16

Related Skills

node-connect

334.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

82.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

334.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

82.2k

Commit, push, and open a PR