<div align="center"> <img src="assets/logo.png" width="30%"/>  <h3>UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding</h3>

Yang Jiao1,2, Haibo Qiu3, Zequn Jie3, Shaoxiang Chen3, Jingjing Chen1,2, Lin Ma3, Yu-Gang Jiang1,2

1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University 2Shanghai Collaborative Innovation Center on Intelligent Visual Computing 3Meituan

</div> <img src="assets/demo.png">

📣 News

[2025-04-02] 🎉🎉🎉 UniToken paper is accepted to CVPR 2025 workshop! 🎉🎉🎉
[2025-04-01] 🎉🎉🎉 We release the recaptioned text prompts of GenEval and T2I-Compbench! 🎉🎉🎉
[2025-04-01] 🎉🎉🎉 UniToken paper and training codes are released! 🎉🎉🎉

🛠️ Installation

See INSTALL.md for detailed instructions.

🎓 Training

See unitoken/TRAIN.md

🤖 Inference

Preparation

Download the original VQ-VAE weights, Lumina-mGPT-512 and SigLIP, and put them to the following directory:

UniToken
- unitoken/
    - ckpts/
        - chameleon/
            - tokenizer/
                - text_tokenizer.json
                - vqgan.yaml
                - vqgan.ckpt
        - Lumina-mGPT-7B-512/
        - SigLIP/
- xllmx/
- ...

Simple Inference

The simplest code for UniToken inference:

from inference_solver_anyres import FlexARInferenceSolverAnyRes
from PIL import Image

# ******************** Image Generation ********************
inference_solver = FlexARInferenceSolverAnyRes(
    model_path="OceanJay/UniToken-AnyRes-StageII",
    precision="bf16",
    target_size=512,
)

q1 = f"Generate an image according to the following prompt:\n" \
     f"A majestic phoenix with fiery wings soaring above a tranquil mountain lake, casting shimmering reflections on the water. Sparks and embers trail behind it as the sky glows with hues of orange and gold."

# generated: tuple of (generated response, list of generated images)
generated = inference_solver.generate_img(
    images=[],
    qas=[[q1, None]],
    max_gen_len=1536,
    temperature=1.0,
    logits_processor=inference_solver.create_logits_processor(cfg=3.0, image_top_k=4000),
)

a1, new_image = generated[0], generated[1][0]


# ******************* Image Understanding ******************
inference_solver = FlexARInferenceSolverAnyRes(
    model_path="OceanJay/UniToken-AnyRes-StageII",
    precision="bf16",
    target_size=512,
)

# "<|image|>" symbol will be replaced with sequence of image tokens before fed to LLM
q1 = "<|image|>Please describe the details of the image as much as possible."

images = [Image.open("../assets/1.png").convert('RGB')]
qas = [[q1, None]]

# `len(images)` should be equal to the number of appearance of "<|image|>" in qas
generated = inference_solver.generate(
    images=images,
    qas=qas,
    max_gen_len=512,
    temperature=1.0,
    logits_processor=inference_solver.create_logits_processor(cfg=4.0, image_top_k=2000),
)

a1 = generated[0]
# generated[1], namely the list of newly generated images, should typically be empty in this case.

🤗 Checkpoints

| Model |Huggingface | | ------------ | ---------------------------------------------------------------------------------------- | | UniToken-base-StageI | OceanJay/UniToken-base-StageI | | UniToken-base-StageII | OceanJay/UniToken-base-StageII | | UniToken-AnyRes-StageI | OceanJay/UniToken-AnyRes-StageI | | UniToken-AnyRes-StageII | OceanJay/UniToken-AnyRes-StageII |

📚 Datasets

We've observed that existing text-to-image generation models struggle with short text prompts in benchmarks such as GenEval and T2I-Compbench++. To address this issue, we have revised these prompts to be more descriptive. We are excited to share our enhanced version on Hugging Face. We encourage you to try it out and see the improvements for your own model!

🙏 Acknowledgement

We sincerely appreciate Lumina-mGPT for providing high-quality training codes, as well as Emu3 and Janus for releasing pretrained checkpoints for evaluation.

📄 Citation

@article{jiao2025unitoken,
  title={UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding},
  author={Jiao, Yang and Qiu, Haibo and Jie, Zequn and Chen, Shaoxiang and Chen, Jingjing and Ma, Lin and Jiang, Yu-Gang},
  journal={arXiv preprint arXiv:2504.04423},
  year={2025}
}

UniToken

Install / Use

README