Cheers
Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation
Install / Use
/learn @AI9Stars/CheersREADME
Cheers <img src="fig/logo.png" width="25"> : Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation
<p align="center"> 🤗 <a href="https://huggingface.co/ai9stars/Cheers">Hugging Face</a>   |   📑 <a href="https://arxiv.org/abs/2603.12793">Paper</a>   </p>Yichen Zhang<sup>1*</sup>, Da Peng<sup>2*</sup>, Zonghao Guo<sup>1†</sup>, Zijian Zhang<sup>3</sup>, Xuesong Yang<sup>3</sup>,
Tong Sun<sup>3</sup>, Shichu Sun<sup>3</sup>, Yidan Zhang<sup>3</sup>, Yanghao Li<sup>1</sup>, Haiyan Zhao<sup>1</sup>, Wang Xu<sup>1</sup>,
Qi Shi<sup>1</sup>, Yangang Sun<sup>1</sup>, Chi Chen<sup>1</sup>, Shuo Wang<sup>1</sup>, Yukun Yan<sup>1</sup>, Xu Han<sup>1</sup>,
Qiang Ma<sup>1</sup>, Wei Ke<sup>2</sup>, Liang Wang<sup>3</sup>, Zhiyuan Liu<sup>1</sup>, Maosong Sun<sup>1</sup>
<sup>1</sup>Tsinghua University, <sup>2</sup>Xi'an Jiaotong University, <sup>3</sup>University of Chinese Academy of Sciences
* Equal contribution † Corresponding author
</div> <img src="fig/case.png" width="100%">🔥 News
- [2026/03/25] 🔥 Our training framework supports image editing training. Please refer to
format.jsonlfor data organization. - [2026/03/19] 🎉 Demo is now available on Hugging Face. Thanks to Prithiv Sakthi for setting it up!
- [2026/03/16] 📢 The Cheers paper is officially released.
- [2026/03/16] 🛠 We open-source the evaluation code and training pipeline. Our codebase is highly efficient: training on 3.8M samples takes only about two days on a single machine with 8×A100 GPUs.
- [2026/03/16] 📦 The model checkpoints of Cheers are now available.
🌟 What is Cheers?
A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling.
🧱 Model Architecture
<img src="fig/model.png" width="100%">🚀 Quick Start
Set up a new virtual environment
conda create -n cheers python=3.11 -y
conda activate cheers
pip install -r requirements.txt
Inference
import os
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
ckpt = "Your Local Checkpoints Path"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained(ckpt, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(ckpt, trust_remote_code=True)
model.to(device)
model = model.to(torch.bfloat16)
model.eval()
1️⃣ Text-to-image generation
content = "Your instruction."
images_batch = [None]
2️⃣ Image Understanding
content = "<im_start><image><im_end>\n Your instruction."
img = Image.open("image_path")
images_batch = [img,]
3️⃣ Text-only Question Answering
content = "Your instruction."
images_batch = [None]
Then run the following code:
messages_batch = [[{"role": "user", "content": content} ],]
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages_batch]
inputs = processor(text=texts, images=images_batch, return_tensors="pt", add_im_start_id=False)
inputs = {k: (v.to(device=device) if isinstance(v, torch.Tensor) else v) for k, v in inputs.items()}
gen_config = {
"max_length": 300,
"cfg_scale": 9.5, # if generation
"temperature": 0.0,
"num_inference_steps": 80, # if use# if generation
"alpha": 0.5, # if generation
"edit_image": False # if generation
}
inputs.update(gen_config)
generated = model.generate(**inputs)
input_ids = generated["input_ids"]
images = generated["images"][0] # if generation
current_img = images[0] # if generation
current_img = current_img.clamp(0.0, 1.0) # if generation
save_image(current_img, f"outputs/case_.png") # if generation
print(processor.tokenizer.batch_decode(input_ids, skip_special_tokens=True)) # if understanding or text-only qa
Alternatively, you can directly run the code in Inference/ for a quick demo.
🔧 Training
Please follow the VeOmni framework guidelines to set up the training environment. The training workspace is located in the Training/ directory. Of course, you can also directly enter the Training/ directory and run pip install -e . to quickly set up the training environment. Then you can run the following scripts:
bash train_align.sh # for alignment
or
bash train_sft.sh # for training all parameters except the VAE.
Notably, the training data format can follow the template at Training/data/format.jsonl. Please also remember to update the training configuration in Training/configs/multimodal/cheers/und_gen_train/.
💡 After training, please replace the config.json file in the output directory with Training/cheers_config/config.json to ensure correct evaluation. And also add Training/cheers_config/modeling_umm.py into output directory.
📊 Evaluation
Understanding
Please follow the VLMEvalKit framework guidelines to set up the evaluation environment. The evaluation workspace is located in the Evaluation_Understanding/ directory. Then you can run the following scripts:
torchrun --master_port=19507 --nproc-per-node=8 run.py --data \
MathVista_MINI MMBench_DEV_EN_V11 SEEDBench_IMG \
MMStar POPE RealWorldQA MMMU_DEV_VAL ScienceQA_TEST \
AI2D_TEST OCRBench TextVQA_VAL ChartQA_TEST \
--model Cheers --verbose
Similarly, you can directly run the following script to perform the evaluation:
bash eval.sh
Please make sure to update the dataset path in eval.sh and the model path in vlmeval/config.py before running the script.
GenEval
Please follow the GenEval framework guidelines to set up the GenEval evaluation environment. The evaluation workspace is located in the Evaluation_GenEval/ directory. Before running the evaluation, please download the Mask2Former object detector and place it in models. Then you can run the following scripts:
bash generation/run_short.sh
or
bash generation/run_long.sh # for rewrite prompt
Use the follow script to get the final score:
bash calculate.sh
DPGBench
Please follow the ELLA framework guidelines to set up the DPGBench evaluation environment. Before running the evaluation, please download the mplug and place it in benchmarks/dpg/mplug_visual-question-answering_coco_large_en. Then you can run the following scripts:
bash Evaluation_DPGBench/scripts/dpg_gen.sh
Then:
bash Evaluation_DPGBench/scripts/dpg_eval.sh # Remember to replace the image folder
🧩 To-Do List
- [x] Release the Inference Scripts and Checkpoints
- [x] Release the Training Scripts using the VeOmni framework
- [x] Release the Evaluation Scripts
- [ ] Release the Training Data
- [ ] Release Cheers v1.1 — maintaining strong understanding performance while further improving generation quality
🙏 Acknowledgement
This repo benefits from VeOmni, VLMEvalKit , GenEval and ELLA. Thanks for their wonderful works.
📬 Contact
For any questions or collaborations, feel free to contact us : )
<p align="left"> 📧 <a href="guozonghao96@outlook.com">guozonghao96@outlook.com</a>    |    📧 <a href="yichen0zhang@gmail.com">yichen0zhang@gmail.com</a>    |    📧 <a href="MetaPDa@gmail.com">MetaPDa@gmail.com</a>    </p>📖 Citation
If you find Cheers useful, plea
