SkillAgentSearch skills...

Crane

A Pure Rust based LLM, VLM, VLA, TTS, OCR Inference Engine, powering by Candle & Rust. Alternate to your llama.cpp but much more simpler and cleaner..

Install / Use

/learn @lucasjinreal/Crane
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Crane 🦩

Crane focusing on accelerate LLM inference speed with the power of kernels in candle framework, while reducing development overhead, make it portable and fast run model on both CPU and GPU.

Crane (🦩) - Candle-based Rust Accelerated Neural Engine A high-performance inference framework leveraging Rust's Candle for maximum speed on CPU/GPU.

Supported Models:

  • [x] Qwen3 (0.6B ~ 30B+)
  • [x] Qwen 2.5 (0.5B ~ 72B)
  • [x] Hunyuan Dense
  • [x] Qwen3 VL (2B, 4B)
  • [x] PaddleOCR VL 0.9B / 1.5
  • [x] Moonshine ASR
  • [x] Silero VAD
  • [x] 🎙️ Qwen3-TTS (12Hz, 24kHz, 16-codebook RVQGAN + native Candle decoder, voice cloning)
  • [ ] 🎙️ TTS: Spark-TTS | Orpheus-TTS (WIP)

submit your models make other users use it easier!

You can run Qwen3-VL 2B with fast speed in local, 50x faster than native PyTorch on M1/M2/M3.

Key Advantages:

  • 🚀 Blazing-Fast Inference: Outperforms native PyTorch with Candle's optimized kernels
  • 🦀 Rust-Powered: Eliminate C++ complexity while maintaining native performance
  • 🍎 Apple Silicon Optimized: Achieve GPU acceleration via Metal on macOS devices
  • 🤖 Hardware Agnostic: Unified codebase for CPU/CUDA/Metal execution
  • 🌐 OpenAI compatible API: Supports OpenAI and SGLang interfaces

Crane maybe the fastest (both speed and develop speed) framework you can use to build your AI applications!

Crane using candle as the only dependencies, inference with fastest speed cross CPUs and GPUs, while your code can be compiled into binary same as llama.cpp does but much more clean and simpler.

Most important!!! Crane is not a low-level SDK, you can call AI abilities out-of-box with ease.

We include:

  • Basic LLM chat;
  • VLM chat;
  • OCR with VLM;
  • VLA (on the way);
  • TTS;
  • ASR;
  • VAD;
  • .... (Any AI ability you want power with AI.)

🔥 Updates

  • 2026.02.23: 🎙️ Qwen3-TTS support added — full Talker + Code Predictor transformer in Candle, native speech-tokenizer decoder (ONNX fallback), voice cloning (Base model ICL), OpenAI /v1/audio/speech endpoint in crane-oai;
  • 2026.02.18: ⚡ Qwen3 & Hunyuan Dense inference optimization: pre-allocated KV cache, GQA 4D matmul, fused RoPE with cache pre-growth, GGUF quantization, batched decode, smart sampling fallback for large vocabularies;
  • 2026.01.30: PaddleOCR-VL-1.5 supported now! model: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5/;
  • 2025.03.21: 🔥 Qwen2.5 a more transformers liked Rust interface were supported, you now use Crane just like in your python;
  • 2025.03.19: 🔥 project initialized;

AI Abilities Use out-of-box

1. OCR

2. more to come

🧐 Why Choose Crane?

While traditional approaches face limitations:

  • PyTorch's suboptimal inference performance
  • llama.cpp's complex C++ codebase and model integration

Crane bridges the gap through:

  1. Candle Framework: Combines Rust's efficiency with PyTorch-like ergonomics
  2. Cross-Platform Acceleration: Metal GPU support achieves 3-5x speedup over CPU-only
  3. Simplified Deployment: Add new models with <100 LOC in most cases

💡 Pro Tip: For macOS developers, Crane delivers comparable performance to llama.cpp with significantly lower maintenance overhead. You can use it out of box directly without any GGUF conversion or something like install llama.cpp etc.

Speed up your LLM inference speed on M series Apple Silicon devices to 6x with almost simillar code in your python (No quantization needed!):


use clap::Parser;
use crane_core::{
    Msg,
    autotokenizer::AutoTokenizer,
    chat::Role,
    generation::{GenerationConfig, based::ModelForCausalLM, streamer::TextStreamer},
    models::{DType, Device, qwen25::Model as Qwen25Model},
};

#[derive(Parser, Debug)]
#[clap(about, version, author)]
struct Args {
    #[clap(short('m'), long, default_value = "checkpoints/Qwen2.5-0.5B-Instruct")]
    model_path: String,
}

fn main() {
    crane_core::utils::utils::print_candle_build_info();

    let args = Args::parse();
    let dtype = DType::F16;
    let device = Device::Cpu;

    let tokenizer = AutoTokenizer::from_pretrained(&args.model_path, None).unwrap();
    let mut model = Qwen25Model::new(&args.model_path, &device, &dtype).unwrap();

    let gen_config = GenerationConfig {
        max_new_tokens: 235,
        temperature: Some(0.67),
        top_p: Some(1.0),
        repetition_penalty: 1.1,
        repeat_last_n: 1,
        do_sample: false,
        pad_token_id: tokenizer.get_token("<|end_of_text|>"),
        eos_token_id: tokenizer.get_token("<|im_end|>"),
        report_speed: true,
    };

    let chats = [
        Msg!(Role::User, "hello"),
        Msg!(Role::Assistant, "Hi, how are you?"),
        Msg!(Role::User, "I am OK, tell me some truth about Yoga."),
    ];
    let prompt = tokenizer.apply_chat_template(&chats, true).unwrap();
    println!("prompt templated: {:?}\n", prompt);

    let input_ids = model.prepare_inputs(&prompt).unwrap();
    let _ = model.warnmup();

    let mut streamer = TextStreamer {
        tokenizer: tokenizer.clone(),
        buffer: String::new(),
    };
    let output_ids = model
        .generate(&input_ids, &gen_config, Some(&mut streamer))
        .map_err(|e| format!("Generation failed: {}", e))
        .unwrap();

    let res = tokenizer.decode(&output_ids, false).unwrap();
    println!("Output: {}", res);
}

Above is all the codes you need to run end2end chat in Qwen2.5 in pure Rust, nothing overhead compare with llama.cpp.

Then, your LLM inference is 6X faster on mac without Quantization! Enabling Quantization could be even faster!

For cli chat, run:

# download models of Qwen2.5
mkdir -p checkpoints/
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct --local-dir checkpoints/Qwen2.5-0.5B-Instruct
cargo run --bin qwenchat --release

📖 Usage

To use crane, here are some notes:

  • crane-core: All models comes into core, this is a lib;
  • crane: All Apps (runnable AI pipelines, such as Qwen2-Chat, Spark-TTS, Qwen2.5-VL etc), you can build your apps inside it, each app is a binary for demonstration purpose;
  • crane-oai: OpenAI & SGLang compatible API server with continuous batching, see crane-oai/README.md for full documentation;
  1. Make sure latest Rust were installed;

  2. Build (choose based on your hardware):

    # CPU
    cargo build --release
    
    # CUDA (GPU)
    cargo build --release --features cuda
    

That's it!

OpenAI API Server

Start a server compatible with OpenAI SDK and SGLang client:

# Build
# CPU
cargo build -p crane-oai --release
# CUDA
cargo build -p crane-oai --release --features cuda

# Start (auto-detect model type and device)
./target/release/crane-oai --model-path /path/to/Qwen2.5-7B-Instruct

# Or run directly
cargo run -p crane-oai --release -- --model-path /path/to/model --port 8000

Then use it with any OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Supported endpoints:

| Family | Endpoint | Description | |--------|----------|-------------| | OpenAI | POST /v1/chat/completions | Chat completions (streaming & non-streaming) | | OpenAI | POST /v1/completions | Text completions | | OpenAI | POST /v1/audio/speech | Text-to-speech (Qwen3-TTS) | | OpenAI | GET /v1/models | List models | | OpenAI | POST /v1/tokenize | Tokenize text | | OpenAI | POST /v1/detokenize | Detokenize tokens | | SGLang | POST /generate | Native text generation | | SGLang | GET /model_info | Model metadata | | SGLang | GET /server_info | Server stats | | SGLang | GET /health_generate | Deep health check | | Mgmt | GET /health | Health check | | Mgmt | GET /v1/stats | Engine statistics |

Text-to-Speech (Qwen3-TTS): For TTS models, the server adds a /v1/audio/speech endpoint (OpenAI-compatible). Both CustomVoice (predefined speakers) and Base (voice cloning via reference audio) models are supported. response_format currently supports wav and pcm (other formats return 400). See crane-oai/README.md for full TTS API documentation.

TTS Examples

# CustomVoice — predefined speakers
cargo run --bin tts_custom_voice --release -- vendor/Qwen3-TTS-12Hz-0.6B-CustomVoice

# Voice Clone — clone speech from reference audio (Base model)
cargo run --bin tts_voice_clone --release -- vendor/Qwen3-TTS-12Hz-0.6B-Base

# Auto-detect model type
cargo run --bin tts_simple --release -- vendor/Qwen3-TTS-12Hz-0.6B-Base

All TTS examples save generated audio files to data/audio/output.

TTS Audio Samples

Multimodal & Vision support: For models like PaddleOCR-VL, the endpoints accept OpenAI's structured messages.[]content.[{type: "image_url", image_url: {url: "..."}}] payload or SGLang's image_url field. See crane-oai/README.md for full API documentation with request/response examples.

Now you can run LLM extremly fast (about 6x faster than vanilla transformers on M1)!

📁 Project Structure

Crane/
├── crane-core/          # Core library: model implementations, tokenizer, generation
│   └── src/models/      # Model architectures (Qwen 2.5, Qwen 3, Hunyuan, etc.)
├── crane/               # High-level SDK: C

Related Skills

View on GitHub
GitHub Stars324
CategoryDevelopment
Updated1d ago
Forks28

Languages

Rust

Security Score

85/100

Audited on Mar 24, 2026

No findings