Fast and lightweight multimodal LLM inference engine for mobile and edge devices

📚 Documentation • 🚀 Quick Start • 💡 Examples • 🛠️ Installation

</div>

Latest News

[2026 Mar 18] 🔥🔥🔥 pymllm now supports CUDA on Jetson Orin and Jetson Thor devices (experimental; still under active development).
[2026 Feb 03] 🔥🔥🔥 MLLM Qnn AOT Support for Full Graph Execution on NPU! Quick Start, Technical Report
[2025 Nov 27] Android Demo Update: Enabled stable Qwen3 and DeepSeek-OCR streaming on Android via a novel In-App Go Server Architecture.
[2025 Nov 23] MLLM v2 released!
[2025 Aug 28] Support for MLLM V1 is ending soon. Before its retirement, V1 will integrate the following features: GPT-OSS. MLLM will then transition to V2, which can be viewed on the V2 branch. V2 will include brand-new capabilities:
- A more Pythonic model authoring approach with eager execution
- Compilation support for easier NPU integration
- Support for parallel execution of multiple models
- A more refined engineering implementation
[2025 Jul 30] Add Rotation Quantization method for QNN backend models and support Qwen-2-VL 2B（ViT profiling will integrate in v2）

Android Demo & Architecture

We have refactored the Android implementation to use a robust Client-Server architecture entirely on-device.

Unlike traditional JNI integration, we introduce an In-App Server layer built with Golang (mllm_server.aar). This design decouples the UI from the heavy inference computation:

Key Features

Pythonic eager execution – Rapid model development
Unified hardware support – Arm CPU, OpenCL GPU, QNN NPU
Advanced optimizations – Quantization, pruning, speculative execution
NPU-ready IR – Seamless integration with NPU frameworks
Deployment toolkit – SDK + CLI inference tool

The Role of MLLM

MLLM is the central hub of the AI inference stack. It connects optimization algorithms like Speculative Decoding, Pruning, and Quantization above with AI Compiler/Runtime layers (CANN, CUDA, MLIR) below for hardware execution. Highlighted in red, MLLM uniquely bridges algorithm innovation and hardware optimization, making it the indispensable node linking software ecosystem and hardware acceleration.

The mllm framework integrates seamlessly with popular community frameworks' checkpoints. Through mllm-convertor, it directly ingests PyTorch and SafeTensors models, quantizes and converts them into mllm format, which are then loaded and executed by mllm Runtime.

Supported Models

mllm v2

| Model(v1) | CPU | Hexagon NPU <br> INT8 | |-----------------------------------------------------------------------------|------|-----------------------| | Qwen3-0.6B | ✔️ w4a8 | | | Qwen3-1.7B | ✔️ w4a8 | W4A16-SM8650 | | Qwen3-4B | ✔️ w4a8 | | | DeepSeek-OCR | ✔️ w4a8 | | | SmolLM3| ✔️ w4a8 | | | Qwen2-VL-2B-Instruct|✔️ w4a8 || | Qwen2-VL-7B-Instruct|✔️ w4a8|| | Qwen2.5-VL-3B-Instruct|✔️ w4a8|| | Qwen2.5-VL-7B-Instruct|✔️ w4a8||

mllm v1

| Model(v1) |------------------------------------------------------------------ | LLaMA 2 7B | LLaMA 3 1B | LLaMA 3 3B | Alpaca 7B | TinyLLaMA 1.1B | LLaVA 7B | Gemma 2B | Gemma 2 2B | Qwen 1.5 0.5B | Qwen 1.5 1.8B | Qwen 2.5 1.5B | Qwen 3 0.6B | Mistral 7B | Yi 6B | StableLM 2 1.6B | Phi | MiniCPM 2B | [MiniCPM 3 4B](h | CPU <br> FP32 | CPU <br> INT4 | Hexagon NPU <br> INT8 | -----------|------|-----|----------------------------| | ✔️ | ✔️ | | | ✔️ | ✔️ | | | ✔️ | ✔️ | | | ✔️ | ✔️ | | | ✔️ | ✔️ | | | ✔️ | ✔️ | | | ✔️ | ✔️ | | | ✔️ | ✔️ | | | ✔️ | ✔️ | ✔️ | | ✔️ | ✔️ | ✔️ | | ✔️ | ✔️ | ✔️ | | ✔️ | ✔️ | | | ✔️ | ✔️ | | | ✔️ | ✔️ | | | ✔️ | ✔️ | | ojects/OPT">OPT 1.3B | ✔️ | ✔️ | | 3 mini 3.8B | ✔️ | ✔️ | | | ✔️ | ✔️ | |

Mllm

Install / Use

README