candle

Candle is a minimalist ML framework for Rust with a focus on performance (including GPU support) and ease of use. Try our online demos: whisper, LLaMA2, T5, yolo, Segment Anything.

Get started

Make sure that you have candle-core correctly installed as described in Installation.

Let's see how to run a simple matrix multiplication. Write the following to your myapp/src/main.rs file:

use candle_core::{Device, Tensor};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let device = Device::Cpu;

    let a = Tensor::randn(0f32, 1., (2, 3), &device)?;
    let b = Tensor::randn(0f32, 1., (3, 4), &device)?;

    let c = a.matmul(&b)?;
    println!("{c}");
    Ok(())
}

cargo run should display a tensor of shape Tensor[[2, 4], f32].

Having installed candle with Cuda support, simply define the device to be on GPU:

- let device = Device::Cpu;
+ let device = Device::new_cuda(0)?;

For more advanced examples, please have a look at the following section.

Check out our examples

These online demos run entirely in your browser:

yolo: pose estimation and object recognition.
whisper: speech recognition.
LLaMA2: text generation.
T5: text generation.
Phi-1.5, and Phi-2: text generation.
Segment Anything Model: Image segmentation.
BLIP: image captioning.

We also provide some command line based examples using state of the art models:

LLaMA v1, v2, and v3: general LLM, includes the SOLAR-10.7B variant.
Falcon: general LLM.
Codegeex4: Code completion, code interpreter, web search, function calling, repository-level
GLM4: Open Multilingual Multimodal Chat LMs by THUDM
Gemma v1 and v2: 2b and 7b+/9b general LLMs from Google Deepmind.
RecurrentGemma: 2b and 7b Griffin based models from Google that mix attention with a RNN like state.
Phi-1, Phi-1.5, Phi-2, and Phi-3: 1.3b, 2.7b, and 3.8b general LLMs with performance on par with 7b models.
StableLM-3B-4E1T: a 3b general LLM pre-trained on 1T tokens of English and code datasets. Also supports StableLM-2, a 1.6b LLM trained on 2T tokens, as well as the code variants.
Mamba: an inference only implementation of the Mamba state space model.
Mistral7b-v0.1: a 7b general LLM with better performance than all publicly available 13b models as of 2023-09-28.
Mixtral8x7b-v0.1: a sparse mixture of experts 8x7b general LLM with better performance than a Llama 2 70B model with much faster inference.
StarCoder and StarCoder2: LLM specialized to code generation.
Qwen1.5: Bilingual (English/Chinese) LLMs.
RWKV v5 and v6: An RNN with transformer level LLM performance.
Replit-code-v1.5: a 3.3b LLM specialized for code completion.
Yi-6B / Yi-34B: two bilingual (English/Chinese) general LLMs with 6b and 34b parameters.
Quantized LLaMA: quantized version of the LLaMA model using the same quantization techniques as llama.cpp.
Quantized Qwen3 MoE: support gguf quantized models of Qwen3 MoE models.

Stable Diffusion: text to image generative model, support for the 1.5, 2.1, SDXL 1.0 and Turbo versions.

Wuerstchen: another text to image generative model.

yolo-v3 and yolo-v8: object detection and pose estimation models.

segment-anything: image segmentation model with prompt.

SegFormer: transformer based semantic segmentation model.
Whisper: speech recognition model.
EnCodec: high-quality audio compression model using residual vector quantization.
MetaVoice: foundational model for text-to-speech.
Parler-TTS: large text-to-speech model.
T5, Bert, JinaBert : useful for sentence embeddings.
DINOv2: computer vision model trained using self-supervision (can be used for imagenet classification, depth evaluation, segmentation).
VGG, RepVGG: computer vision models.
BLIP: image to text model, can be used to generate captions for an image.
CLIP: multi-model vision and language model.
TrOCR: a transformer OCR model, with dedicated submodels for hand-writing and printed recognition.
Marian-MT: neural machine translation model, generates the translated text from the input text.
Moondream: tiny computer-vision model that can answer real-world questions about images.

Run them using commands like:

cargo run --example quantized --release

In order to use CUDA add --features cuda to the example command line. If you have cuDNN installed, use --features cudnn for even more speedups.

There are also some wasm examples for whisper and llama2.c. You can either build them with trunk or try them online: whisper, llama2, T5, Phi-1.5, and Phi-2, Segment Anything Model.

For LLaMA2, run the following command to retrieve the weight files and start a test server:

cd candle-wasm-examples/llama2-c
wget https://huggingface.co/spaces/lmz/candle-llama2/resolve/main/model.bin
wget https://huggingface.co/spaces/lmz/candle-llama2/resolve/main/tokenizer.json
trunk serve --release --port 8081

And then head over to http://localhost:8081/.

Useful External Resources

candle-tutorial: A very detailed tutorial showing how to convert a PyTorch model to Candle.
candle-lora: Efficient and ergonomic LoRA implementation for Candle. candle-lora has
out-of-the-box LoRA support for many models from Candle, which can be found here.
candle-video: Rust library for text-to-video generation (LTX-Video and related models) built on Candle, focused on fast, Python-free inferen

Candle

Install / Use

README

candle

Get started

Check out our examples

Useful External Resources