Mllm
Fast Multimodal LLM on Mobile Devices
Install / Use
/learn @UbiquitousLearning/MllmREADME
Fast and lightweight multimodal LLM inference engine for mobile and edge devices
📚 Documentation • 🚀 Quick Start • 💡 Examples • 🛠️ Installation
</div>Latest News
- [2026 Mar 18] 🔥🔥🔥
pymllmnow supports CUDA on Jetson Orin and Jetson Thor devices (experimental; still under active development). - [2026 Feb 03] 🔥🔥🔥 MLLM Qnn AOT Support for Full Graph Execution on NPU! Quick Start, Technical Report
- [2025 Nov 27] Android Demo Update: Enabled stable Qwen3 and DeepSeek-OCR streaming on Android via a novel In-App Go Server Architecture.
- [2025 Nov 23] MLLM v2 released!
- [2025 Aug 28] Support for MLLM V1 is ending soon. Before its retirement, V1 will integrate the following features: GPT-OSS. MLLM will then transition to V2, which can be viewed on the V2 branch. V2 will include brand-new capabilities:
- A more Pythonic model authoring approach with eager execution
- Compilation support for easier NPU integration
- Support for parallel execution of multiple models
- A more refined engineering implementation
- [2025 Jul 30] Add Rotation Quantization method for QNN backend models and support Qwen-2-VL 2B(ViT profiling will integrate in v2)
Android Demo & Architecture
We have refactored the Android implementation to use a robust Client-Server architecture entirely on-device.
<table width="100%"> <tr> <td width="50%"> <video src="https://github.com/user-attachments/assets/33581025-3368-4b38-98e8-6a2628b32408" controls="controls" style="max-width: 100%;"></video> </td> <td width="50%"> <video src="https://github.com/user-attachments/assets/edcfb568-4415-41fc-91b0-136a4b9a20e2" controls="controls" style="max-width: 100%;"></video> </td> </tr> </table>Unlike traditional JNI integration, we introduce an In-App Server layer built with Golang (mllm_server.aar). This design decouples the UI from the heavy inference computation:
Key Features
- Pythonic eager execution – Rapid model development
- Unified hardware support – Arm CPU, OpenCL GPU, QNN NPU
- Advanced optimizations – Quantization, pruning, speculative execution
- NPU-ready IR – Seamless integration with NPU frameworks
- Deployment toolkit – SDK + CLI inference tool
The Role of MLLM
MLLM is the central hub of the AI inference stack. It connects optimization algorithms like Speculative Decoding, Pruning, and Quantization above with AI Compiler/Runtime layers (CANN, CUDA, MLIR) below for hardware execution. Highlighted in red, MLLM uniquely bridges algorithm innovation and hardware optimization, making it the indispensable node linking software ecosystem and hardware acceleration.
<div align="center"> <img src="./assets/mllm_role.png" width="80%"> </div>The mllm framework integrates seamlessly with popular community frameworks' checkpoints. Through mllm-convertor, it directly ingests PyTorch and SafeTensors models, quantizes and converts them into mllm format, which are then loaded and executed by mllm Runtime.
<div align="center"> <img src="./assets/mllm_workflow.png" width="80%"> </div>Supported Models
mllm v2
| Model(v1) | CPU | Hexagon NPU <br> INT8 | |-----------------------------------------------------------------------------|------|-----------------------| | Qwen3-0.6B | ✔️ w4a8 | | | Qwen3-1.7B | ✔️ w4a8 | W4A16-SM8650 | | Qwen3-4B | ✔️ w4a8 | | | DeepSeek-OCR | ✔️ w4a8 | | | SmolLM3| ✔️ w4a8 | | | Qwen2-VL-2B-Instruct|✔️ w4a8 || | Qwen2-VL-7B-Instruct|✔️ w4a8|| | Qwen2.5-VL-3B-Instruct|✔️ w4a8|| | Qwen2.5-VL-7B-Instruct|✔️ w4a8||
mllm v1
| Model(v1) | CPU <br> FP32 | CPU <br> INT4 | Hexagon NPU <br> INT8 | |-----------------------------------------------------------------------------|------|-----|----------------------------| | LLaMA 2 7B | ✔️ | ✔️ | | | LLaMA 3 1B | ✔️ | ✔️ | | | LLaMA 3 3B | ✔️ | ✔️ | | | Alpaca 7B | ✔️ | ✔️ | | | TinyLLaMA 1.1B | ✔️ | ✔️ | | | LLaVA 7B | ✔️ | ✔️ | | | Gemma 2B | ✔️ | ✔️ | | | Gemma 2 2B | ✔️ | ✔️ | | | Qwen 1.5 0.5B | ✔️ | ✔️ | ✔️ | | Qwen 1.5 1.8B | ✔️ | ✔️ | ✔️ | | Qwen 2.5 1.5B | ✔️ | ✔️ | ✔️ | | Qwen 3 0.6B | ✔️ | ✔️ | | | Mistral 7B | ✔️ | ✔️ | | | Yi 6B | ✔️ | ✔️ | | | StableLM 2 1.6B | ✔️ | ✔️ | | | OPT 1.3B | ✔️ | ✔️ | | | Phi 3 mini 3.8B | ✔️ | ✔️ | | | MiniCPM 2B | ✔️ | ✔️ | | | [MiniCPM 3 4B](h
