SkillAgentSearch skills...

Cheers

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

Install / Use

/learn @AI9Stars/Cheers
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center">

Cheers <img src="fig/logo.png" width="25"> : Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

<p align="center"> 🤗 <a href="https://huggingface.co/ai9stars/Cheers">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://arxiv.org/abs/2603.12793">Paper</a>&nbsp&nbsp </p>

Yichen Zhang<sup>1*</sup>, Da Peng<sup>2*</sup>, Zonghao Guo<sup>1†</sup>, Zijian Zhang<sup>3</sup>, Xuesong Yang<sup>3</sup>,

Tong Sun<sup>3</sup>, Shichu Sun<sup>3</sup>, Yidan Zhang<sup>3</sup>, Yanghao Li<sup>1</sup>, Haiyan Zhao<sup>1</sup>, Wang Xu<sup>1</sup>,

Qi Shi<sup>1</sup>, Yangang Sun<sup>1</sup>, Chi Chen<sup>1</sup>, Shuo Wang<sup>1</sup>, Yukun Yan<sup>1</sup>, Xu Han<sup>1</sup>,

Qiang Ma<sup>1</sup>, Wei Ke<sup>2</sup>, Liang Wang<sup>3</sup>, Zhiyuan Liu<sup>1</sup>, Maosong Sun<sup>1</sup>

<sup>1</sup>Tsinghua University, <sup>2</sup>Xi'an Jiaotong University, <sup>3</sup>University of Chinese Academy of Sciences

* Equal contribution † Corresponding author

</div> <img src="fig/case.png" width="100%">

🔥 News

  • [2026/03/25] 🔥 Our training framework supports image editing training. Please refer to format.jsonl for data organization.
  • [2026/03/19] 🎉 Demo is now available on Hugging Face. Thanks to Prithiv Sakthi for setting it up!
  • [2026/03/16] 📢 The Cheers paper is officially released.
  • [2026/03/16] 🛠 We open-source the evaluation code and training pipeline. Our codebase is highly efficient: training on 3.8M samples takes only about two days on a single machine with 8×A100 GPUs.
  • [2026/03/16] 📦 The model checkpoints of Cheers are now available.

🌟 What is Cheers?

A recent cutting-edge topic in multimodal modeling is to unify visual comprehension and generation within a single model. However, the two tasks demand mismatched decoding regimes and visual representations, making it non-trivial to jointly optimize within a shared feature space. In this work, we present Cheers, a unified multimodal model that decouples patch-level details from semantic representations, thereby stabilizing semantics for multimodal understanding and improving fidelity for image generation via gated detail residuals. Cheers includes three key components: (i) a unified vision tokenizer that encodes and compresses image latent states into semantic tokens for efficient LLM conditioning, (ii) an LLM-based Transformer that unifies autoregressive decoding for text generation and diffusion decoding for image generation, and (iii) a cascaded flow matching head that decodes visual semantics first and then injects semantically gated detail residuals from the vision tokenizer to refine high-frequency content. Experiments on popular benchmarks demonstrate that Cheers matches or surpasses advanced UMMs in both visual understanding and generation. Notably, Cheers outperforms the Tar-1.5B on the popular benchmarks GenEval and MMBench, while requiring only 20% of the training cost, indicating effective and efficient (i.e., 4x token compression) unified multimodal modeling.

🧱 Model Architecture

<img src="fig/model.png" width="100%">

🚀 Quick Start

Set up a new virtual environment

conda create -n cheers python=3.11 -y
conda activate cheers
pip install -r requirements.txt

Inference

import os
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
ckpt = "Your Local Checkpoints Path"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = AutoProcessor.from_pretrained(ckpt, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(ckpt, trust_remote_code=True)
model.to(device)
model = model.to(torch.bfloat16)
model.eval()

1️⃣ Text-to-image generation

content = "Your instruction."
images_batch = [None]

2️⃣ Image Understanding

content = "<im_start><image><im_end>\n Your instruction."
img = Image.open("image_path")
images_batch = [img,]

3️⃣ Text-only Question Answering

content = "Your instruction."
images_batch = [None]

Then run the following code:

messages_batch = [[{"role": "user", "content": content} ],]
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True) for msg in messages_batch]
inputs = processor(text=texts, images=images_batch, return_tensors="pt", add_im_start_id=False)
inputs = {k: (v.to(device=device) if isinstance(v, torch.Tensor) else v) for k, v in inputs.items()}

gen_config = {
    "max_length": 300,
    "cfg_scale": 9.5, # if generation
    "temperature": 0.0,
    "num_inference_steps": 80, # if use# if generation
    "alpha": 0.5, # if generation
    "edit_image": False # if generation
    }

inputs.update(gen_config)
generated = model.generate(**inputs)
input_ids = generated["input_ids"]

images = generated["images"][0] # if generation
current_img = images[0] # if generation
current_img = current_img.clamp(0.0, 1.0) # if generation
save_image(current_img, f"outputs/case_.png") # if generation
print(processor.tokenizer.batch_decode(input_ids, skip_special_tokens=True)) # if understanding or text-only qa

Alternatively, you can directly run the code in Inference/ for a quick demo.

🔧 Training

Please follow the VeOmni framework guidelines to set up the training environment. The training workspace is located in the Training/ directory. Of course, you can also directly enter the Training/ directory and run pip install -e . to quickly set up the training environment. Then you can run the following scripts:

bash train_align.sh # for alignment

or

bash train_sft.sh # for training all parameters except the VAE.

Notably, the training data format can follow the template at Training/data/format.jsonl. Please also remember to update the training configuration in Training/configs/multimodal/cheers/und_gen_train/.

💡 After training, please replace the config.json file in the output directory with Training/cheers_config/config.json to ensure correct evaluation. And also add Training/cheers_config/modeling_umm.py into output directory.

📊 Evaluation

Understanding

Please follow the VLMEvalKit framework guidelines to set up the evaluation environment. The evaluation workspace is located in the Evaluation_Understanding/ directory. Then you can run the following scripts:

torchrun --master_port=19507 --nproc-per-node=8  run.py --data \
    MathVista_MINI MMBench_DEV_EN_V11 SEEDBench_IMG \
    MMStar POPE RealWorldQA MMMU_DEV_VAL ScienceQA_TEST  \
    AI2D_TEST OCRBench TextVQA_VAL ChartQA_TEST \
    --model Cheers --verbose 

Similarly, you can directly run the following script to perform the evaluation:

bash eval.sh

Please make sure to update the dataset path in eval.sh and the model path in vlmeval/config.py before running the script.

GenEval

Please follow the GenEval framework guidelines to set up the GenEval evaluation environment. The evaluation workspace is located in the Evaluation_GenEval/ directory. Before running the evaluation, please download the Mask2Former object detector and place it in models. Then you can run the following scripts:

bash generation/run_short.sh

or

bash generation/run_long.sh # for rewrite prompt

Use the follow script to get the final score:

bash calculate.sh

DPGBench

Please follow the ELLA framework guidelines to set up the DPGBench evaluation environment. Before running the evaluation, please download the mplug and place it in benchmarks/dpg/mplug_visual-question-answering_coco_large_en. Then you can run the following scripts:

bash Evaluation_DPGBench/scripts/dpg_gen.sh

Then:

bash Evaluation_DPGBench/scripts/dpg_eval.sh # Remember to replace the image folder

🧩 To-Do List

  • [x] Release the Inference Scripts and Checkpoints
  • [x] Release the Training Scripts using the VeOmni framework
  • [x] Release the Evaluation Scripts
  • [ ] Release the Training Data
  • [ ] Release Cheers v1.1 — maintaining strong understanding performance while further improving generation quality

🙏 Acknowledgement

This repo benefits from VeOmni, VLMEvalKit , GenEval and ELLA. Thanks for their wonderful works.

📬 Contact

For any questions or collaborations, feel free to contact us : )

<p align="left"> 📧 <a href="guozonghao96@outlook.com">guozonghao96@outlook.com</a> &nbsp&nbsp | &nbsp&nbsp 📧 <a href="yichen0zhang@gmail.com">yichen0zhang@gmail.com</a> &nbsp&nbsp | &nbsp&nbsp 📧 <a href="MetaPDa@gmail.com">MetaPDa@gmail.com</a> &nbsp&nbsp </p>

📖 Citation

If you find Cheers useful, plea

View on GitHub
GitHub Stars380
CategoryDevelopment
Updated6h ago
Forks34

Languages

Python

Security Score

80/100

Audited on Apr 1, 2026

No findings