SkillAgentSearch skills...

Cactus

Low-latency AI engine for mobile devices & wearables

Install / Use

/learn @cactus-compute/Cactus

README

Cactus

<img src="assets/banner.jpg" alt="Logo" style="border-radius: 30px; width: 100%;">

[![Docs][docs-shield]][docs-url] [![Website][website-shield]][website-url] [![GitHub][github-shield]][github-url] [![HuggingFace][hf-shield]][hf-url] [![Reddit][reddit-shield]][reddit-url] [![Blog][blog-shield]][blog-url]

A low-latency AI engine for mobile devices & wearables. Main features:

  • Fast: fastest inference on ARM CPU
  • Low RAM: zero-copy memory mapping ensures 10x lower RAM use than other engines
  • Multimodal: one SDK for speech, vision, and language models
  • Cloud fallback: automatically route requests to cloud models if needed
  • Energy-efficient: NPU-accelerated prefill
┌─────────────────┐
│  Cactus Engine  │ ←── OpenAI-compatible APIs for all major languages
└─────────────────┘     Chat, vision, STT, RAG, tool call, cloud handoff
         │
┌─────────────────┐
│  Cactus Graph   │ ←── Zero-copy computation graph (PyTorch for mobile)
└─────────────────┘     Custom models, optimised for RAM & quantisation
         │
┌─────────────────┐
│ Cactus Kernels  │ ←── ARM SIMD kernels (Apple, Snapdragon, Exynos, etc)
└─────────────────┘     Custom attention, KV-cache quant, chunked prefill

Quick Demo (Mac)

  • Step 1: brew install cactus-compute/cactus/cactus
  • Step 2: cactus transcribe or cactus run

Cactus Engine

#include "cactus.h"

cactus_model_t model = cactus_init(
    "path/to/weight/folder",
    "path to txt or dir of txts for auto-rag",
    false
);

const char* messages = R"([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "My name is Henry Ndubuaku"}
])";

const char* options = R"({
    "max_tokens": 50,
    "stop_sequences": ["<|im_end|>"]
})";

char response[4096];
int result = cactus_complete(
    model,            // model handle
    messages,         // JSON chat messages
    response,         // response buffer
    sizeof(response), // buffer size
    options,          // generation options
    nullptr,          // tools JSON
    nullptr,          // streaming callback
    nullptr           // user data
);

Example response from Gemma3-270m

{
    "success": true,        // generation succeeded
    "error": null,          // error details if failed
    "cloud_handoff": false, // true if cloud model used
    "response": "Hi there!",
    "function_calls": [],   // parsed tool calls
    "confidence": 0.8193,   // model confidence
    "time_to_first_token_ms": 45.23,
    "total_time_ms": 163.67,
    "prefill_tps": 1621.89,
    "decode_tps": 168.42,
    "ram_usage_mb": 245.67,
    "prefill_tokens": 28,
    "decode_tokens": 50,
    "total_tokens": 78
}

Cactus Graph

#include "cactus.h"

CactusGraph graph;
auto a = graph.input({2, 3}, Precision::FP16);
auto b = graph.input({3, 4}, Precision::INT8);

auto x1 = graph.matmul(a, b, false);
auto x2 = graph.transpose(x1);
auto result = graph.matmul(b, x2, true);

float a_data[6] = {1.1f, 2.3f, 3.4f, 4.2f, 5.7f, 6.8f};
float b_data[12] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};

graph.set_input(a, a_data, Precision::FP16);
graph.set_input(b, b_data, Precision::INT8);

graph.execute();
void* output_data = graph.get_output(result);

graph.hard_reset(); 

API & SDK References

| Reference | Language | Description | |-----------|----------|-------------| | Engine API | C | Chat completion, streaming, tool calling, transcription, embeddings, RAG, vision, VAD, vector index, cloud handoff | | Graph API | C++ | Tensor operations, matrix multiplication, attention, normalization, activation functions | | Python SDK | Python | Mac, Linux | | Swift SDK | Swift | iOS, macOS, tvOS, watchOS, Android | | Kotlin SDK | Kotlin | Android, iOS (via KMP) | | Flutter SDK | Dart | iOS, macOS, Android | | Rust SDK | Rust | Mac, Linux | | React Native | JavaScript | iOS, Android |

Model weights: Pre-converted weights for all supported models at huggingface.co/Cactus-Compute.

Benchmarks (CPU-only, no GPU)

  • All weights INT4 quantised
  • LFM: 1k-prefill / 100-decode, values are prefill tps / decode tps
  • LFM-VL: 256px input, values are latency / decode tps
  • Parakeet: 20s audio input, values are latency / decode tps
  • Missing latency = no NPU support yet

| Device | LFM 1.2B | LFMVL 1.6B | Parakeet 1.1B | VL RAM Usage | |--------|----------|------------|---------------|-----| | Mac M4 Pro | 582/100 | 0.2s/98 | 0.1s/900k+ | 76MB | | iPad/Mac M3 | 350/60 | 0.3s/69 | 0.3s/800k+ | 70MB | | iPhone 17 Pro | 327/48 | 0.3s/48 | 0.3s/300k+ | 108MB | | iPhone 13 Mini | 148/34 | 0.3s/35 | 0.7s/90k+ | 1GB | | Galaxy S25 Ultra | 255/37 | -/34 | -/250k+ | 1.5GB | | Pixel 6a | 70/15 | -/15 | -/17k+ | 1GB | | Galaxy A17 5G | 32/10 | -/11 | -/40k+ | 727MB | | CMF Phone 2 Pro | - | - | - | - | | Raspberry Pi 5 | 69/11 | 13.3s/11 | 4.5s/180k+ | 869MB |

Supported Transcription Model

  • STT: 20s audio input on Macbook Air M3 chip
  • Benchmark dataset: internal evals with production users

| Model | Params | End2End ms | Latency ms | Decode toks/sec | NPU | RTF | WER | |-------|--------|------------|------------|------------|-----|-----|-----| | UsefulSensors/moonshine-base | 61M | 361 | 182 | 262 | yes | 0.0180 | 0.1395 | | openai/whisper-tiny | 39M | 232 | 137 | 581 | yes | 0.0116 | 0.1860 | | openai/whisper-base | 74M | 329 | 178 | 358 | yes | 0.0164 | 0.1628 | | openai/whisper-small | 244M | 856 | 332 | 108 | yes | 0.0428 | 0.0930 | | openai/whisper-medium | 769M | 2085 | 923 | 49 | yes | 0.1041 | 0.0930 | | nvidia/parakeet-ctc-0.6b | 600M | 201 | 201 | 5214285 | yes | 0.0101 | 0.0930 | | nvidia/parakeet-tdt-0.6b-v3 | 600M | 718 | 718 | 3583333 | yes | 0.0359 | 0.0465 | | nvidia/parakeet-ctc-1.1b | 1.1B | 279 | 278 | 4562500 | yes | 0.0139 | 0.1628 | | snakers4/silero-vad | - | - | - | - | - | - | - |

Supported LLMs

  • Gemma weights are often gated on HuggingFace, needs tokens
  • Run huggingface-cli login and input your huggingface token

| Model | Features |
|-------|----------| | google/gemma-3-270m-it | completion | | google/functiongemma-270m-it | tools | | google/gemma-3-1b-it | completion, gated | | google/gemma-3n-E2B-it | completion, tools | | google/gemma-3n-E4B-it | completion, tools | | Qwen/Qwen3-0.6B | completion, tools, embed | | Qwen/Qwen3-Embedding-0.6B | embed | | Qwen/Qwen3.5-0.8B | vision, completion, tools, embed | | Qwen/Qwen3-1.7B | completion, tools, embed | | Qwen/Qwen3.5-2B | vision, completion, tools, embed | | LiquidAI/LFM2-350M | completion, tools, embed | | LiquidAI/LFM2-700M | completion, tools, embed | | LiquidAI/LFM2-8B-A1B | completion, tools, embed | | LiquidAI/LFM2.5-1.2B-Thinking | completion, tools, embed | | LiquidAI/LFM2.5-1.2B-Instruct | completion, tools, embed | | LiquidAI/LFM2-2.6B | completion, tools, embed | | LiquidAI/LFM2-VL-450M | vision, txt & img embed, Apple NPU | | LiquidAI/LFM2.5-VL-1.6B | vision, txt & img embed, Apple NPU | | tencent/Youtu-LLM-2B | completion, tools, embed | | nomic-ai/nomic-embed-text-v2-moe | embed |

Roadmap

| Date | Status | Milestone | |------|--------|-----------| | Sep 2025 | Done | Released v1 | | Oct 2025 | Done | Chunked prefill, KVCache Quant (2x prefill) | | Nov 2025 | Done | Cactus Attention (10 & 1k prefill = same decode) | | Dec 2025 | Done | Team grows to +6 Research Engineers | | Jan 2026 | Done | Apple NPU/RAM, 5-11x faster iOS/Mac | | Feb 2026 | Done | Hybrid inference, INT4, lossless Quant (1.5x) | | Mar 2026 | Coming | Qualcomm/Google NPUs, 5-11x faster Android | | Apr 2026 | Coming | Mediatek/Exynos NPUs, Cactus@ICLR | | May 2026 | Coming | Wearables & custom chips optimisations | | Jun 2026 | Coming | Torch/JAX model transpilers |

Using this repo

┌──────────────────────────────────────────────────────────────────────────────┐
│                                                                              │
│ Step 0: if on Linux (Ubuntu/Debian)                                          │
│ sudo apt-get install python3 python3-venv python3-pip cmake                  │
│   build-essential libcurl4-openssl-dev                                       │
│                                                                              │
│ Step 1: clone and setup                                                      │
│ git clone https://github.com/cactus-compute/cactus && cd cactus              │
│ source ./setup                                                               │
│                                                                              │
│ Step 2: use the commands                                                     │
│──────────────────────────────────────────────────────────────────────────────│
│                                                                              │
│  cactus auth                         manage Cloud API key                    │
│    --status                          show key status                         │
│    --clear                           remove saved key                        │
│                                                                              │
│  cactus run <model>                  opens playground (auto downloads)       │
│    --precision INT4|INT8|FP16        quantization (default: INT4)            │
│    --token <token>                   HF token (gated models)                 │
│    --reconvert                       force reconversion from source          │
│                                                                              │
│  cactus transcribe [model]           live mic transcription (parakeet-1.1b)  │
│    --file <audio.wav>                transcribe file instead of mic          │
│    --precision INT4|INT8|FP16  

Related Skills

View on GitHub
GitHub Stars4.5k
CategoryDevelopment
Updated4h ago
Forks335

Languages

C

Security Score

85/100

Audited on Mar 24, 2026

No findings