ARMOR

ARMOR: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy

Generate Convert Improve

Install / Use

/learn @finyorko/ARMOR

About this skill

Quality Score

0/100

README

<div align="center">  <h1> ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability </h1>

Jianwen Sun1,2* · Yukang Feng1,2* · Chuanhao Li5· Fanrui Zhang2,3 Zizhen Li1,2 · Jiaxin Ai2,4 · Sizhuo Zhou2,3 · Pengfei Zhou5 Yu Dai1 · Shenglin Zhang1 · Kaipeng Zhang2,5†

1Nankai University 2Shanghai Innovation Institute 3University of Science and Technology of China 4WUHAN University 5Shanghai AI Laboratory *equal contribution †corresponding author

</div>

💡 News

2025/03/09: The technical report of ARMOR is released! Our code will be released soon.

📖 Introduction

This repo implements ARMOR, a unified visual tokenizer well-suited for both generation and understanding tasks. It operates within a single autoregressive framework to enable interleaved image-text inputs and outputs, autonomously selecting the most appropriate response modality depending on the query.

📖 Overview

architecture

Armor is a unified understanding and generation model improved based on a multimodal large language model (MLLM). It operates within a single autoregressive framework to enable interleaved image-text inputs and outputs, autonomously selecting the most appropriate response modality depending on the query. Building upon a pretrained MLLM, Armor employs a unified embedding space to represent both textual and visual information, thus reducing model complexity, and introduces an asymmetric encoder-decoder to unify generation and understanding. Through training on a meticulously curated, high-quality dataset of interleaved text and images with our proposed What or How to Generate (WoHG) method, Armor not only preserves much of the original model’s capabilities but also achieves impressive image generation performance. Coupled with a forward-switching mechanism, Armor enables highly natural text-image interleaved output, all while requiring minimal computational resources. The research findings indicate that enhancing a pretrained MLLM with an autoregressive architecture and an asymmetric encoder-decoder demonstrates substantial potential and research value for developing unified understanding and generation models. Furthermore, the results also reaffirm that a fully autoregressive approach remains a promising foundation for building unified large-scale model architectures.

🏆 Expirement

Understanding Performance

und_result

Generation Performance

🚀 Usage

Download VAE ckpt

huggingface-cli download --resume-download HerzogFL/chameleon_vae --local-dir code/VAE --local-dir-use-symlinks False

coming soon...

📞 Contact

Jianwen Sun: sunjianwen@mail.nankai.edu.cn
Yukang Feng: yukangfeng@mail.nankai.edu.cn
Kaipeng Zhang: zhangkaipeng@pjlab.org.cn

🖊️ Citation

If you feel ARMOR useful in your project or research, please kindly use the following BibTeX entry to cite our paper. Thanks!

@misc{sun2025armorempoweringmultimodalunderstanding,
      title={ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability}, 
      author={Jianwen Sun and Yukang Feng and Chuanhao Li and Fanrui Zhang and Zizhen Li and Jiaxin Ai and Sizhuo Zhou and Yu Dai and Shenglin Zhang and Kaipeng Zhang},
      year={2025},
      eprint={2503.06542},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.06542}, 
}

Related Skills

node-connect

353.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

353.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

353.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。