Cheers <img src="fig/logo.png" width="25"> : Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

🤗 <a href="https://huggingface.co/ai9stars/Cheers">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2603.12793">Paper</a>&nbsp&nbsp

Yichen Zhang1*, Da Peng2*, Zonghao Guo1†, Zijian Zhang3, Xuesong Yang3,

Tong Sun3, Shichu Sun3, Yidan Zhang3, Yanghao Li1, Haiyan Zhao1, Wang Xu1,

Qi Shi1, Yangang Sun1, Chi Chen1, Shuo Wang1, Yukun Yan1, Xu Han1,

Qiang Ma1, Wei Ke2, Liang Wang3, Zhiyuan Liu1, Maosong Sun1

1Tsinghua University, 2Xi'an Jiaotong University, 3University of Chinese Academy of Sciences

* Equal contribution † Corresponding author

</div> <img src="fig/case.png" width="100%">

🔥 News

[2026/03/25] 🔥 Our training framework supports image editing training. Please refer to format.jsonl for data organization.
[2026/03/19] 🎉 Demo is now available on Hugging Face. Thanks to Prithiv Sakthi for setting it up!
[2026/03/16] 📢 The Cheers paper is officially released.
[2026/03/16] 🛠 We open-source the evaluation code and training pipeline. Our codebase is highly efficient: training on 3.8M samples takes only about two days on a single machine with 8×A100 GPUs.
[2026/03/16] 📦 The model checkpoints of Cheers are now available.

🌟 What is Cheers?

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling.

🧱 Model Architecture

🚀 Quick Start

Set up a new virtual environment

conda create -n cheers python=3.11 -y
conda activate cheers
pip install -r requirements.txt

Inference

import os
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
ckpt = "Your Local Checkpoints Path"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained(ckpt, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(ckpt, trust_remote_code=True)
model.to(device)
model = model.to(torch.bfloat16)
model.eval()

1️⃣ Text-to-image generation

content = "Your instruction."
images_batch = [None]

2️⃣ Image Understanding

content = "<im_start><image><im_end>\n Your instruction."
img = Image.open("image_path")
images_batch = [img,]

3️⃣ Text-only Question Answering

content = "Your instruction."
images_batch = [None]

Then run the following code:

messages_batch = [[{"role": "user", "content": content} ],]
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages_batch]
inputs = processor(text=texts, images=images_batch, return_tensors="pt", add_im_start_id=False)
inputs = {k: (v.to(device=device) if isinstance(v, torch.Tensor) else v) for k, v in inputs.items()}

gen_config = {
    "max_length": 300,
    "cfg_scale": 9.5, # if generation
    "temperature": 0.0,
    "num_inference_steps": 80, # if use# if generation
    "alpha": 0.5, # if generation
    "edit_image": False # if generation
    }

inputs.update(gen_config)
generated = model.generate(**inputs)
input_ids = generated["input_ids"]

images = generated["images"][0] # if generation
current_img = images[0] # if generation
current_img = current_img.clamp(0.0, 1.0) # if generation
save_image(current_img, f"outputs/case_.png") # if generation
print(processor.tokenizer.batch_decode(input_ids, skip_special_tokens=True)) # if understanding or text-only qa

Alternatively, you can directly run the code in Inference/ for a quick demo.

🔧 Training

Please follow the VeOmni framework guidelines to set up the training environment. The training workspace is located in the Training/ directory. Of course, you can also directly enter the Training/ directory and run pip install -e . to quickly set up the training environment. Then you can run the following scripts:

bash train_align.sh # for alignment

bash train_sft.sh # for training all parameters except the VAE.

Notably, the training data format can follow the template at Training/data/format.jsonl. Please also remember to update the training configuration in Training/configs/multimodal/cheers/und_gen_train/.

💡 After training, please replace the config.json file in the output directory with Training/cheers_config/config.json to ensure correct evaluation. And also add Training/cheers_config/modeling_umm.py into output directory.

📊 Evaluation

Understanding

Please follow the VLMEvalKit framework guidelines to set up the evaluation environment. The evaluation workspace is located in the Evaluation_Understanding/ directory. Then you can run the following scripts:

torchrun --master_port=19507 --nproc-per-node=8  run.py --data \
    MathVista_MINI MMBench_DEV_EN_V11 SEEDBench_IMG \
    MMStar POPE RealWorldQA MMMU_DEV_VAL ScienceQA_TEST  \
    AI2D_TEST OCRBench TextVQA_VAL ChartQA_TEST \
    --model Cheers --verbose

Similarly, you can directly run the following script to perform the evaluation:

bash eval.sh

Please make sure to update the dataset path in eval.sh and the model path in vlmeval/config.py before running the script.

GenEval

Please follow the GenEval framework guidelines to set up the GenEval evaluation environment. The evaluation workspace is located in the Evaluation_GenEval/ directory. Before running the evaluation, please download the Mask2Former object detector and place it in models. Then you can run the following scripts:

bash generation/run_short.sh

bash generation/run_long.sh # for rewrite prompt

Use the follow script to get the final score:

bash calculate.sh

DPGBench

Please follow the ELLA framework guidelines to set up the DPGBench evaluation environment. Before running the evaluation, please download the mplug and place it in benchmarks/dpg/mplug_visual-question-answering_coco_large_en. Then you can run the following scripts:

bash Evaluation_DPGBench/scripts/dpg_gen.sh

Then:

bash Evaluation_DPGBench/scripts/dpg_eval.sh # Remember to replace the image folder

🧩 To-Do List

[x] Release the Inference Scripts and Checkpoints
[x] Release the Training Scripts using the VeOmni framework
[x] Release the Evaluation Scripts
[ ] Release the Training Data
[ ] Release Cheers v1.1 — maintaining strong understanding performance while further improving generation quality

🙏 Acknowledgement

This repo benefits from VeOmni, VLMEvalKit , GenEval and ELLA. Thanks for their wonderful works.

📬 Contact

For any questions or collaborations, feel free to contact us : )

📧 <a href="guozonghao96@outlook.com">guozonghao96@outlook.com</a> &nbsp&nbsp | &nbsp&nbsp 📧 <a href="yichen0zhang@gmail.com">yichen0zhang@gmail.com</a> &nbsp&nbsp | &nbsp&nbsp 📧 <a href="MetaPDa@gmail.com">MetaPDa@gmail.com</a> &nbsp&nbsp

📖 Citation

If you find Cheers useful, plea

Cheers

Install / Use

README