UniFork

UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

Generate Convert Improve

Install / Use

/learn @tliby/UniFork

About this skill

Quality Score

0/100

README

UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

Official implementation of UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation

Teng Li, Quanfeng Lu, Lirui Zhao, Hao Li, Xizhou Zhu, Yu Qiao, Jun Zhang, Wenqi Shao

Updates

[2025/06/20] We release the training, inference and evaluation codes of UniFork.

Introduction

This paper presents UniFork, a Y-shaped architecture for unified image generation and understanding:

We analyze task-specific modality alignment patterns in expert models, highlighting the differing needs of image understanding and generation, and providing insights for unified model design.
We propose UniFork, a Y-shaped architecture that decouples task-specific learning in the later layers while retaining shared semantic representation learning in the early layers. This design enables cross-task learning and alleviates performance conflicts between tasks.
<img src="assets/method.png" alt="method" style="zoom: 20%;" />

Installation

Environment setup

git clone https://github.com/tliby/UniFork.git
cd UniFork
conda create -n unifork python=3.10
conda activate unifork
pip install -r requirements.txt

Install pretrained models for training

Our code is based on Qwen2.5-0.5B LLM and VILA-U-256 tokenizer. Please download the pretrained weights:

We have modified the tokenizer configuration with configs/config.json to adjust the size of the image head. You can replace the default tokenizer config with this file before launching training.

Prepare training datasets

The training Stage1 of UniFork is conducted on the following datasets:

By default, our pipeline expects the annotation for each dataset to be organized as a folder containing .jsonl or .txt files. To use your own dataset, you should modify the dataset loading logic in unifork/train/data_utils.py.

Training

We provide all the scripts in scripts/train. Suppose you have access to a SLURM clsuter, you can run the following command to start training:

sbatch scripts/train/s1_imagenet.sh

Inference

Once the training is complete, you can run inference using the following command:

Image generation

python infer_t2i.py \
    --model-path /path/to/model \
    --prompt "your prompt"

Image understanding

python infer_mmu.py \
    --model_path /path/to/model \
    --image-path /path/to/your/image \
    --query "your query"

Evaluation

Image generation

We provide sampling scripts for the MJHQ-30K and Geneval benchmarks. Your need to download the annotation file: [Geneval prompt] [MJHQ-30K prompt]. Then run following command:

python scripts/eval_gen/sample_geneval_batch.py \
    --model-path /path/to/model \
    --metadata-file geneval/<PROMPT_FOLDER>/evaluation_metadata.jsonl \
    --outdir geneval/<IMAGE_FOLDER>

After generation, clone the [Geneval] repo and follow their instructions to compute accuracy-based metrics.

python scripts/eval_gen/sample_mjhq_batch.py \
    --model-path /path/to/model \
    --metadata-file mjhq-30k/meta_data.json \
    --outdir output/generated_samples_mjhq

After generation, download the [MJHQ-30K images], clone the [pytorch-fid] repo and follow their instructions to compute fid score.

Image Understanding

Our evaluation framework is based on the [LLaVA] codebase. We provide scripts on common benchmarks:

bash scripts/eval_und/mme.sh
bash scripts/eval_und/pope.sh
bash scripts/eval_und/seed.sh
bash scripts/eval_und/vqav2.sh

For evaluation on more benchmarks, we recommend integrating your model into [VLMEvalKit], a comprehensive evaluation toolkit for vision-language models.

Acknowledgement

Our code is built on LLaVA, LlamaGen and Qwen2.5. Thanks for their efforts!

BibTeX

@article{li2025unifork,
  title={UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation},
  author={Li, Teng and Lu, Quanfeng and Zhao, Lirui and Li, Hao and Zhu, Xizhou and Qiao, Yu and Zhang, Jun and Shao, Wenqi},
  journal={arXiv preprint arXiv:2506.17202},
  year={2025}
}

Related Skills

node-connect

343.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

92.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。