GoT

Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"

Generate Convert Improve

Install / Use

/learn @rongyaofang/GoT

About this skill

Quality Score

0/100

README

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

Rongyao Fang<sup>1*</sup>, Chengqi Duan<sup>2*</sup>, Kun Wang<sup>3</sup>, Linjiang Huang<sup>6</sup>, Hao Li<sup>1,4</sup>, Shilin Yan, Hao Tian<sup>3</sup>, Xingyu Zeng<sup>3</sup>, Rui Zhao<sup>3</sup>, Jifeng Dai<sup>4,5</sup>, Xihui Liu<sup>2 :envelope:</sup>, Hongsheng Li<sup>1 :envelope:</sup>

<sup>1</sup>CUHK MMLab, <sup>2</sup>HKU MMLab, <sup>3</sup>SenseTime, <sup>4</sup>Shanghai AI Laboratory, <sup>5</sup>Tsinghua University, <sup>6</sup>Beihang University

*Equal contribution, :envelope:Corresponding authors

</div> <div align="center"> <img src="figures/teaser.jpg" width="100%" alt="GoT Framework" /> </div> <hr> <div align="center" style="line-height: 1.2;"> <a href="https://arxiv.org/abs/2503.10639" target="_blank"><b>Paper</b></a> • <a href="#introduction">Introduction</a> • <a href="#released-datasets">Datasets</a> • <a href="#released-model-got-framework">Model</a> • <a href="#results">Results</a> • <a href="https://huggingface.co/LucasFang/GoT-6B" target="_blank">🤗 Hugging Face</a> • <a href="#license">License</a> </div>

🔥 News

[2025-9-19] 📝 Our GoT paper has been accepted by NeurIPS 2025!
[2025-9-12] 🎉 We open-sourced our latest work FLUX-Reason-6M dataset! This high-quality text-to-image reasoning dataset was constructed using 15,000 A100 GPU days with FLUX generation. Check it out at FLUX-Reason-6M!

Introduction

We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements.

GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent through:

Semantic-Spatial Reasoning: Integrates both semantic understanding and explicit spatial coordinates
Unified Framework: Handles both image generation and editing with the same architecture

Released Datasets

| Dataset | Link | Amount | |---------|------|--------| | Laion-Aesthetics-High-Resolution-GoT | 🤗 HuggingFace | 3.77M | | JourneyDB-GoT | 🤗 HuggingFace | 4.09M | | OmniEdit-GoT | 🤗 HuggingFace | 736K | | FLUX-Reason-6M | 🤗 HuggingFace | 6M |

Dataset Features

Laion-Aesthetics-High-Resolution-GoT

3.77 million High-quality images filtered for sizes larger than 512 pixels from Laion-Aesthetics
Prompts and GoT descriptions from Qwen2-VL
Prompts averaging 110.81 characters
GoT descriptions averaging 811.56 characters
3.78 bounding boxes per image on average

JourneyDB-GoT

4.09 million high-quality AI-generated images
Prompts and GoT descriptions from Qwen2-VL
Prompts averaging 149.78 characters
GoT descriptions averaging 906.01 characters
4.09 bounding boxes per image on average
Please download the images from JourneyDB dataset

OmniEdit-GoT

736K high-quality image editing samples from OmniEdit
Diverse editing operations (addition, removal, swap, attribute changes, style transfer)
Detailed reasoning chains with step-by-step editing processes
Precise spatial coordinate annotations for editing regions
Please download the images from OmniEdit dataset

FLUX-Reason-6M

6 million high-quality text-to-image reasoning dataset constructed with pure FLUX generation
Built using 15,000 A100 GPU days for superior quality and reasoning capabilities
Comprehensive reasoning chains for complex visual generation tasks
Designed to enhance multimodal reasoning in visual generation models

Released Model: GoT Framework

| Model | Link | Architecture | |------------|------|----------------------| | GoT-6B | 🤗 HuggingFace | Qwen2.5-VL-3B + SDXL |

Model Features

Our GoT framework consists of two key components:

Semantic-Spatial MLLM: Generates detailed reasoning chains with spatial information using Qwen2.5-VL as the backbone
SSGM Diffusion Module: Leverages the semantic guidance, spatial layouts, and reference images to create high-quality visual outputs

The Semantic-Spatial Guidance Module (SSGM) combines three guidance pathways:

Semantic Guidance: Captures relationships and attributes
Spatial Guidance: Controls precise object placement
Reference Guidance: Provides context for editing tasks

Results

Text-to-Image Generation

GoT achieves state-of-the-art performance on the GenEval benchmark, particularly excelling in composition tasks:

| Method | Architecture | Overall | Single Obj. | Two Obj. | Counting | Colors | Position | Attr. Binding | |--------|--------------|---------|-------------|----------|----------|--------|----------|---------------| | SD-XL | Unet+CLIP | 0.55 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | | SD3 | MMDIT+CLIP+T5 | 0.62 | 0.98 | 0.74 | 0.63 | 0.67 | 0.34 | 0.36 | | Emu3-Gen | Autoregressive | 0.54 | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 | | Janus | Autoregressive | 0.61 | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 | | JanusFlow | Autoregressive | 0.63 | 0.97 | 0.59 | 0.45 | 0.83 | 0.53 | 0.42 | | GoT Framework | Unet+Qwen2.5-VL | 0.64 | 0.99 | 0.69 | 0.67 | 0.85 | 0.34 | 0.27 |

</div>

Image Editing

Our approach also demonstrates superior performance on image editing benchmarks:

| Method | Emu-Edit | | ImagenHub | Reason-Edit | |--------|----------|--------|-----------|------------| | | CLIP-I | CLIP-T | GPT-4o Eval. | GPT-4o Eval. | | IP2P | 0.834 | 0.219 | 0.308 | 0.286 | | MagicBrush | 0.838 | 0.222 | 0.513 | 0.334 | | SEED-X | 0.825 | 0.272 | 0.166 | 0.239 | | CosXL-Edit | 0.860 | 0.274 | 0.464 | 0.325 | | GoT Framework | 0.864 | 0.276 | 0.533 | 0.561 |

</div>

Usage

Dependencies

Python >= 3.8 (Recommend to use Anaconda)
PyTorch >=2.0.1
NVIDIA GPU + CUDA

Installation

Clone the repo and install dependent packages

git clone git@github.com:rongyaofang/GoT.git
cd GoT
pip install -r requirements.txt

Model Weights

Place the required model weights in the ./pretrained directory as follows:

Your directory structure should match the following:

GoT
├── pretrained
│   ├── GoT-6B
│   ├── Qwen2.5-VL-3B-Instruct
│   └── stable-diffusion-xl-base-1.0
├── ...

Inference

Follow the instructions in the inference notebook

License

This code is released under the MIT License.

Citation

If you find this work helpful, please consider citing:

@article{fang2025got,
  title={GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing},
  author={Fang, Rongyao and Duan, Chengqi and Wang, Kun and Huang, Linjiang and Li, Hao and Yan, Shilin and Tian, Hao and Zeng, Xingyu and Zhao, Rui and Dai, Jifeng and Liu, Xihui and Li, Hongsheng},
  journal={arXiv preprint arXiv:2503.10639},
  year={2025}
}

Contact

If you have any questions, please raise an issue or contact us at rongyaofang@gmail.com.

Related Skills

node-connect

343.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

92.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。