GoT
Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"
Install / Use
/learn @rongyaofang/GoTREADME
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing
<div align="center"> <a href="https://github.com/rongyaofang/GoT"><img src="https://img.shields.io/badge/Project-Homepage-green" alt="Home"></a> <a href="https://arxiv.org/abs/2503.10639"><img src="https://img.shields.io/badge/ArXiv-2503.10639-red"></a>Rongyao Fang<sup>1*</sup>, Chengqi Duan<sup>2*</sup>, Kun Wang<sup>3</sup>, Linjiang Huang<sup>6</sup>, Hao Li<sup>1,4</sup>, Shilin Yan, Hao Tian<sup>3</sup>, Xingyu Zeng<sup>3</sup>, Rui Zhao<sup>3</sup>, Jifeng Dai<sup>4,5</sup>, Xihui Liu<sup>2 :envelope:</sup>, Hongsheng Li<sup>1 :envelope:</sup>
<sup>1</sup>CUHK MMLab, <sup>2</sup>HKU MMLab, <sup>3</sup>SenseTime, <sup>4</sup>Shanghai AI Laboratory, <sup>5</sup>Tsinghua University, <sup>6</sup>Beihang University
*Equal contribution, :envelope:Corresponding authors
</div> <div align="center"> <img src="figures/teaser.jpg" width="100%" alt="GoT Framework" /> </div> <hr> <div align="center" style="line-height: 1.2;"> <a href="https://arxiv.org/abs/2503.10639" target="_blank"><b>Paper</b></a> • <a href="#introduction">Introduction</a> • <a href="#released-datasets">Datasets</a> • <a href="#released-model-got-framework">Model</a> • <a href="#results">Results</a> • <a href="https://huggingface.co/LucasFang/GoT-6B" target="_blank">🤗 Hugging Face</a> • <a href="#license">License</a> </div>🔥 News
- [2025-9-19] 📝 Our GoT paper has been accepted by NeurIPS 2025!
- [2025-9-12] 🎉 We open-sourced our latest work FLUX-Reason-6M dataset! This high-quality text-to-image reasoning dataset was constructed using 15,000 A100 GPU days with FLUX generation. Check it out at FLUX-Reason-6M!
Introduction
We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements.
GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent through:
- Semantic-Spatial Reasoning: Integrates both semantic understanding and explicit spatial coordinates
- Unified Framework: Handles both image generation and editing with the same architecture
Released Datasets
| Dataset | Link | Amount | |---------|------|--------| | Laion-Aesthetics-High-Resolution-GoT | 🤗 HuggingFace | 3.77M | | JourneyDB-GoT | 🤗 HuggingFace | 4.09M | | OmniEdit-GoT | 🤗 HuggingFace | 736K | | FLUX-Reason-6M | 🤗 HuggingFace | 6M |
Dataset Features
Laion-Aesthetics-High-Resolution-GoT
- 3.77 million High-quality images filtered for sizes larger than 512 pixels from Laion-Aesthetics
- Prompts and GoT descriptions from Qwen2-VL
- Prompts averaging 110.81 characters
- GoT descriptions averaging 811.56 characters
- 3.78 bounding boxes per image on average
JourneyDB-GoT
- 4.09 million high-quality AI-generated images
- Prompts and GoT descriptions from Qwen2-VL
- Prompts averaging 149.78 characters
- GoT descriptions averaging 906.01 characters
- 4.09 bounding boxes per image on average
- Please download the images from JourneyDB dataset
OmniEdit-GoT
- 736K high-quality image editing samples from OmniEdit
- Diverse editing operations (addition, removal, swap, attribute changes, style transfer)
- Detailed reasoning chains with step-by-step editing processes
- Precise spatial coordinate annotations for editing regions
- Please download the images from OmniEdit dataset
FLUX-Reason-6M
- 6 million high-quality text-to-image reasoning dataset constructed with pure FLUX generation
- Built using 15,000 A100 GPU days for superior quality and reasoning capabilities
- Comprehensive reasoning chains for complex visual generation tasks
- Designed to enhance multimodal reasoning in visual generation models
Released Model: GoT Framework
| Model | Link | Architecture | |------------|------|----------------------| | GoT-6B | 🤗 HuggingFace | Qwen2.5-VL-3B + SDXL |
Model Features
<div align="center"> <img src="figures/architecture.jpg" width="100%" alt="GoT Architecture" /> </div>Our GoT framework consists of two key components:
- Semantic-Spatial MLLM: Generates detailed reasoning chains with spatial information using Qwen2.5-VL as the backbone
- SSGM Diffusion Module: Leverages the semantic guidance, spatial layouts, and reference images to create high-quality visual outputs
The Semantic-Spatial Guidance Module (SSGM) combines three guidance pathways:
- Semantic Guidance: Captures relationships and attributes
- Spatial Guidance: Controls precise object placement
- Reference Guidance: Provides context for editing tasks
Results
Text-to-Image Generation
GoT achieves state-of-the-art performance on the GenEval benchmark, particularly excelling in composition tasks:
<div align="center">| Method | Architecture | Overall | Single Obj. | Two Obj. | Counting | Colors | Position | Attr. Binding | |--------|--------------|---------|-------------|----------|----------|--------|----------|---------------| | SD-XL | Unet+CLIP | 0.55 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | | SD3 | MMDIT+CLIP+T5 | 0.62 | 0.98 | 0.74 | 0.63 | 0.67 | 0.34 | 0.36 | | Emu3-Gen | Autoregressive | 0.54 | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 | | Janus | Autoregressive | 0.61 | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 | | JanusFlow | Autoregressive | 0.63 | 0.97 | 0.59 | 0.45 | 0.83 | 0.53 | 0.42 | | GoT Framework | Unet+Qwen2.5-VL | 0.64 | 0.99 | 0.69 | 0.67 | 0.85 | 0.34 | 0.27 |
</div>Image Editing
Our approach also demonstrates superior performance on image editing benchmarks:
<div align="center">| Method | Emu-Edit | | ImagenHub | Reason-Edit | |--------|----------|--------|-----------|------------| | | CLIP-I | CLIP-T | GPT-4o Eval. | GPT-4o Eval. | | IP2P | 0.834 | 0.219 | 0.308 | 0.286 | | MagicBrush | 0.838 | 0.222 | 0.513 | 0.334 | | SEED-X | 0.825 | 0.272 | 0.166 | 0.239 | | CosXL-Edit | 0.860 | 0.274 | 0.464 | 0.325 | | GoT Framework | 0.864 | 0.276 | 0.533 | 0.561 |
</div>Usage
Dependencies
- Python >= 3.8 (Recommend to use Anaconda)
- PyTorch >=2.0.1
- NVIDIA GPU + CUDA
Installation
Clone the repo and install dependent packages
git clone git@github.com:rongyaofang/GoT.git
cd GoT
pip install -r requirements.txt
Model Weights
Place the required model weights in the ./pretrained directory as follows:
- GoT-6B model weights
- Qwen2.5-VL-3B-Instruct
- Stable Diffusion XL Base 1.0
Your directory structure should match the following:
GoT
├── pretrained
│ ├── GoT-6B
│ ├── Qwen2.5-VL-3B-Instruct
│ └── stable-diffusion-xl-base-1.0
├── ...
Inference
Follow the instructions in the inference notebook
License
This code is released under the MIT License.
Citation
If you find this work helpful, please consider citing:
@article{fang2025got,
title={GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing},
author={Fang, Rongyao and Duan, Chengqi and Wang, Kun and Huang, Linjiang and Li, Hao and Yan, Shilin and Tian, Hao and Zeng, Xingyu and Zhao, Rui and Dai, Jifeng and Liu, Xihui and Li, Hongsheng},
journal={arXiv preprint arXiv:2503.10639},
year={2025}
}
Contact
If you have any questions, please raise an issue or contact us at rongyaofang@gmail.com.
Related Skills
node-connect
343.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
92.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
