SkillAgentSearch skills...

GoT

Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"

Install / Use

/learn @rongyaofang/GoT
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

<div align="center"> <a href="https://github.com/rongyaofang/GoT"><img src="https://img.shields.io/badge/Project-Homepage-green" alt="Home"></a> <a href="https://arxiv.org/abs/2503.10639"><img src="https://img.shields.io/badge/ArXiv-2503.10639-red"></a>

Rongyao Fang<sup>1*</sup>, Chengqi Duan<sup>2*</sup>, Kun Wang<sup>3</sup>, Linjiang Huang<sup>6</sup>, Hao Li<sup>1,4</sup>, Shilin Yan, Hao Tian<sup>3</sup>, Xingyu Zeng<sup>3</sup>, Rui Zhao<sup>3</sup>, Jifeng Dai<sup>4,5</sup>, Xihui Liu<sup>2 :envelope:</sup>, Hongsheng Li<sup>1 :envelope:</sup>

<sup>1</sup>CUHK MMLab, <sup>2</sup>HKU MMLab, <sup>3</sup>SenseTime, <sup>4</sup>Shanghai AI Laboratory, <sup>5</sup>Tsinghua University, <sup>6</sup>Beihang University

*Equal contribution, :envelope:Corresponding authors

</div> <div align="center"> <img src="figures/teaser.jpg" width="100%" alt="GoT Framework" /> </div> <hr> <div align="center" style="line-height: 1.2;"> <a href="https://arxiv.org/abs/2503.10639" target="_blank"><b>Paper</b></a> • <a href="#introduction">Introduction</a> • <a href="#released-datasets">Datasets</a> • <a href="#released-model-got-framework">Model</a> • <a href="#results">Results</a> • <a href="https://huggingface.co/LucasFang/GoT-6B" target="_blank">🤗 Hugging Face</a> • <a href="#license">License</a> </div>

🔥 News

  • [2025-9-19] 📝 Our GoT paper has been accepted by NeurIPS 2025!
  • [2025-9-12] 🎉 We open-sourced our latest work FLUX-Reason-6M dataset! This high-quality text-to-image reasoning dataset was constructed using 15,000 A100 GPU days with FLUX generation. Check it out at FLUX-Reason-6M!

Introduction

We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images. This approach transforms conventional text-to-image generation and editing into a reasoning-guided framework that analyzes semantic relationships and spatial arrangements.

GoT pioneers a new direction for reasoning-driven visual generation and editing, producing images that better align with human intent through:

  • Semantic-Spatial Reasoning: Integrates both semantic understanding and explicit spatial coordinates
  • Unified Framework: Handles both image generation and editing with the same architecture

Released Datasets

| Dataset | Link | Amount | |---------|------|--------| | Laion-Aesthetics-High-Resolution-GoT | 🤗 HuggingFace | 3.77M | | JourneyDB-GoT | 🤗 HuggingFace | 4.09M | | OmniEdit-GoT | 🤗 HuggingFace | 736K | | FLUX-Reason-6M | 🤗 HuggingFace | 6M |

Dataset Features

Laion-Aesthetics-High-Resolution-GoT

  • 3.77 million High-quality images filtered for sizes larger than 512 pixels from Laion-Aesthetics
  • Prompts and GoT descriptions from Qwen2-VL
  • Prompts averaging 110.81 characters
  • GoT descriptions averaging 811.56 characters
  • 3.78 bounding boxes per image on average

JourneyDB-GoT

  • 4.09 million high-quality AI-generated images
  • Prompts and GoT descriptions from Qwen2-VL
  • Prompts averaging 149.78 characters
  • GoT descriptions averaging 906.01 characters
  • 4.09 bounding boxes per image on average
  • Please download the images from JourneyDB dataset

OmniEdit-GoT

  • 736K high-quality image editing samples from OmniEdit
  • Diverse editing operations (addition, removal, swap, attribute changes, style transfer)
  • Detailed reasoning chains with step-by-step editing processes
  • Precise spatial coordinate annotations for editing regions
  • Please download the images from OmniEdit dataset

FLUX-Reason-6M

  • 6 million high-quality text-to-image reasoning dataset constructed with pure FLUX generation
  • Built using 15,000 A100 GPU days for superior quality and reasoning capabilities
  • Comprehensive reasoning chains for complex visual generation tasks
  • Designed to enhance multimodal reasoning in visual generation models

Released Model: GoT Framework

| Model | Link | Architecture | |------------|------|----------------------| | GoT-6B | 🤗 HuggingFace | Qwen2.5-VL-3B + SDXL |

Model Features

<div align="center"> <img src="figures/architecture.jpg" width="100%" alt="GoT Architecture" /> </div>

Our GoT framework consists of two key components:

  1. Semantic-Spatial MLLM: Generates detailed reasoning chains with spatial information using Qwen2.5-VL as the backbone
  2. SSGM Diffusion Module: Leverages the semantic guidance, spatial layouts, and reference images to create high-quality visual outputs

The Semantic-Spatial Guidance Module (SSGM) combines three guidance pathways:

  • Semantic Guidance: Captures relationships and attributes
  • Spatial Guidance: Controls precise object placement
  • Reference Guidance: Provides context for editing tasks

Results

Text-to-Image Generation

GoT achieves state-of-the-art performance on the GenEval benchmark, particularly excelling in composition tasks:

<div align="center">

| Method | Architecture | Overall | Single Obj. | Two Obj. | Counting | Colors | Position | Attr. Binding | |--------|--------------|---------|-------------|----------|----------|--------|----------|---------------| | SD-XL | Unet+CLIP | 0.55 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | | SD3 | MMDIT+CLIP+T5 | 0.62 | 0.98 | 0.74 | 0.63 | 0.67 | 0.34 | 0.36 | | Emu3-Gen | Autoregressive | 0.54 | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 | | Janus | Autoregressive | 0.61 | 0.97 | 0.68 | 0.30 | 0.84 | 0.46 | 0.42 | | JanusFlow | Autoregressive | 0.63 | 0.97 | 0.59 | 0.45 | 0.83 | 0.53 | 0.42 | | GoT Framework | Unet+Qwen2.5-VL | 0.64 | 0.99 | 0.69 | 0.67 | 0.85 | 0.34 | 0.27 |

</div>

Image Editing

Our approach also demonstrates superior performance on image editing benchmarks:

<div align="center">

| Method | Emu-Edit | | ImagenHub | Reason-Edit | |--------|----------|--------|-----------|------------| | | CLIP-I | CLIP-T | GPT-4o Eval. | GPT-4o Eval. | | IP2P | 0.834 | 0.219 | 0.308 | 0.286 | | MagicBrush | 0.838 | 0.222 | 0.513 | 0.334 | | SEED-X | 0.825 | 0.272 | 0.166 | 0.239 | | CosXL-Edit | 0.860 | 0.274 | 0.464 | 0.325 | | GoT Framework | 0.864 | 0.276 | 0.533 | 0.561 |

</div>

Usage

Dependencies

Installation

Clone the repo and install dependent packages

git clone git@github.com:rongyaofang/GoT.git
cd GoT
pip install -r requirements.txt

Model Weights

Place the required model weights in the ./pretrained directory as follows:

  1. GoT-6B model weights
  2. Qwen2.5-VL-3B-Instruct
  3. Stable Diffusion XL Base 1.0

Your directory structure should match the following:

GoT
├── pretrained
│   ├── GoT-6B
│   ├── Qwen2.5-VL-3B-Instruct
│   └── stable-diffusion-xl-base-1.0
├── ...

Inference

Follow the instructions in the inference notebook

License

This code is released under the MIT License.

Citation

If you find this work helpful, please consider citing:

@article{fang2025got,
  title={GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing},
  author={Fang, Rongyao and Duan, Chengqi and Wang, Kun and Huang, Linjiang and Li, Hao and Yan, Shilin and Tian, Hao and Zeng, Xingyu and Zhao, Rui and Dai, Jifeng and Liu, Xihui and Li, Hongsheng},
  journal={arXiv preprint arXiv:2503.10639},
  year={2025}
}

Contact

If you have any questions, please raise an issue or contact us at rongyaofang@gmail.com.

Related Skills

View on GitHub
GitHub Stars315
CategoryDevelopment
Updated7d ago
Forks11

Languages

Jupyter Notebook

Security Score

95/100

Audited on Mar 24, 2026

No findings