SkillAgentSearch skills...

DreamEngine

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think!

Install / Use

/learn @chenllliang/DreamEngine
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<h1 align="center"> DreamEngine</h1> <p align="center"> <a href="https://arxiv.org/abs/2502.20172"> <img alt="Static Badge" src="https://img.shields.io/badge/Paper-arXiv-red"> </a> <a href="https://huggingface.co/leonardPKU/DreamEngine-ObjectFusion"> <img alt="Static Badge" src="https://img.shields.io/badge/Model-HuggingFace-yellow"> </a> </p> <img width="898" alt="截屏2025-02-23 22 38 04" src="https://github.com/user-attachments/assets/cdeb00d3-2c6a-459a-b884-7946a51075b8" />

DreamEngine is a unified framework that integrates multimodal encoders like QwenVL with diffusion models through a two-stage training approach, enabling advanced text-image interleaved control and achieving state-of-the-art performance in generating images with complex, concept-merged inputs.

https://github.com/user-attachments/assets/f070c0e8-4ea2-4294-b878-63f564da1b79

Updates:

  • 2025-03-03: Release checkpoint and a demo for text-guided object fusion.

Run the Demo locally

bash setup.sh

# setup the paths in demo.py
python src/scripts/eval/demo.py

Model Structure

<img width="1136" alt="截屏2025-02-27 23 14 47" src="https://github.com/user-attachments/assets/ce5bf658-e571-440a-ad10-6b4e68fb6c4a" />

Training

<img width="1136" alt="截屏2025-02-27 23 15 16" src="https://github.com/user-attachments/assets/bb47b53b-e512-43b1-a791-ee066ee52927" />

Demos

<img width="1136" alt="截屏2025-02-27 23 15 03" src="https://github.com/user-attachments/assets/025af313-9360-4602-a4cc-78c7a4fb7137" /> <img width="1136" alt="截屏2025-02-27 23 15 24" src="https://github.com/user-attachments/assets/94fcf73f-3f1d-4baa-8a81-edb9d3fc3c42" /> <img width="1136" alt="截屏2025-02-27 23 15 30" src="https://github.com/user-attachments/assets/3e624f79-7d24-4b14-aa9d-46a8eee0c271" />

Citation

If you feel the work helpful, please kindly cite

@misc{chen2025multimodalrepresentationalignmentimage,
      title={Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think}, 
      author={Liang Chen and Shuai Bai and Wenhao Chai and Weichu Xie and Haozhe Zhao and Leon Vinci and Junyang Lin and Baobao Chang},
      year={2025},
      eprint={2502.20172},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.20172}, 
}
View on GitHub
GitHub Stars122
CategoryContent
Updated16d ago
Forks4

Languages

Python

Security Score

80/100

Audited on Mar 15, 2026

No findings