ARMOR
ARMOR: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy
Install / Use
/learn @finyorko/ARMORREADME
Jianwen Sun<sup>1,2*</sup> · Yukang Feng<sup>1,2*</sup> · Chuanhao Li<sup>5</sup>· Fanrui Zhang<sup>2,3</sup> <br> Zizhen Li<sup>1,2</sup> · Jiaxin Ai<sup>2,4</sup> · Sizhuo Zhou<sup>2,3</sup> · Pengfei Zhou<sup>5</sup> <br> Yu Dai<sup>1</sup> · Shenglin Zhang<sup>1</sup> · Kaipeng Zhang<sup>2,5†</sup>
<sup>1</sup>Nankai University <sup>2</sup>Shanghai Innovation Institute<br> <sup>3</sup>University of Science and Technology of China <sup>4</sup>WUHAN University <br><sup>5</sup>Shanghai AI Laboratory<br> <br> *equal contribution †corresponding author
<a href="https://arxiv.org/abs/2503.06542"><img src='https://img.shields.io/badge/arXiv-ARMOR-red' alt='Paper PDF'></a> <a href=""><img src='https://img.shields.io/badge/Project_Page-ARMOR-green' alt='Project Page'></a> <a href=""><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
</div>💡 News
2025/03/09: The technical report of ARMOR is released! Our code will be released soon.
📖 Introduction
This repo implements ARMOR, a unified visual tokenizer well-suited for both generation and understanding tasks. It operates within a single autoregressive framework to enable interleaved image-text inputs and outputs, autonomously selecting the most appropriate response modality depending on the query.
📖 Overview

Armor is a unified understanding and generation model improved based on a multimodal large language model (MLLM). It operates within a single autoregressive framework to enable interleaved image-text inputs and outputs, autonomously selecting the most appropriate response modality depending on the query. Building upon a pretrained MLLM, Armor employs a unified embedding space to represent both textual and visual information, thus reducing model complexity, and introduces an asymmetric encoder-decoder to unify generation and understanding. Through training on a meticulously curated, high-quality dataset of interleaved text and images with our proposed What or How to Generate (WoHG) method, Armor not only preserves much of the original model’s capabilities but also achieves impressive image generation performance. Coupled with a forward-switching mechanism, Armor enables highly natural text-image interleaved output, all while requiring minimal computational resources. The research findings indicate that enhancing a pretrained MLLM with an autoregressive architecture and an asymmetric encoder-decoder demonstrates substantial potential and research value for developing unified understanding and generation models. Furthermore, the results also reaffirm that a fully autoregressive approach remains a promising foundation for building unified large-scale model architectures.
🏆 Expirement
Understanding Performance

Generation Performance
<img src="assets/gen_result.png" width="450px">🚀 Usage
- Download VAE ckpt
huggingface-cli download --resume-download HerzogFL/chameleon_vae --local-dir code/VAE --local-dir-use-symlinks False
- coming soon...
📞 Contact
- Jianwen Sun: sunjianwen@mail.nankai.edu.cn
- Yukang Feng: yukangfeng@mail.nankai.edu.cn
- Kaipeng Zhang: zhangkaipeng@pjlab.org.cn
🖊️ Citation
If you feel ARMOR useful in your project or research, please kindly use the following BibTeX entry to cite our paper. Thanks!
@misc{sun2025armorempoweringmultimodalunderstanding,
title={ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability},
author={Jianwen Sun and Yukang Feng and Chuanhao Li and Fanrui Zhang and Zizhen Li and Jiaxin Ai and Sizhuo Zhou and Yu Dai and Shenglin Zhang and Kaipeng Zhang},
year={2025},
eprint={2503.06542},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.06542},
}
Related Skills
node-connect
353.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
353.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
353.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
