InstanceDiffusion

[CVPR 2024] Code release for "InstanceDiffusion: Instance-level Control for Image Generation"

Generate Convert Improve

Install / Use

/learn @frank-xwang/InstanceDiffusion

About this skill

Quality Score

0/100

README

InstanceDiffusion: Instance-level Control for Image Generation

We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. InstanceDiffusion supports free-form language conditions per instance and allows flexible ways to specify instance locations such as simple single points, scribbles, bounding boxes or intricate instance segmentation masks, and combinations thereof. Compared to the previous SOTA, InstanceDiffusion achieves 2.0 times higher AP50 for box inputs and 1.7 times higher IoU for mask inputs.

InstanceDiffusion: Instance-level Control for Image Generation
XuDong Wang, Trevor Darrell, Saketh Rambhatla, Rohit Girdhar, Ishan Misra
GenAI, Meta; BAIR, UC Berkeley
CVPR 2024

[project page] [arxiv] [PDF] [bibtex]

Disclaimer

This repository represents a re-implementation of InstanceDiffusion conducted by the first author during his time at UC Berkeley. Minor performance discrepancies may exist compared to the results reported in the original paper. The goal of this repository is to replicate the original paper's findings and insights, primarily for academic and research purposes.

Updates

01/01/2025 - InstanceDiffusion has been ported to the diffusers library (thanks to Kyeongryeol Go)! Please refer to the model card and the pull request for details.
02/25/2024 - InstanceDiffusion is ported into ComfyUI. Check out some cool video demos! (thanks to Tucker Darby)
02/21/2024 - Support flash attention, memory usage can be reduced by more than half.
02/19/2024 - Add PiM evaluation for scribble-/point-based image generation
02/10/2024 - Add model evaluation on attribute binding
02/09/2024 - Add model evaluation using the MSCOCO dataset
02/05/2024 - Initial commit. Stay tuned

Installation

Requirements

Linux or macOS with Python ≥ 3.8
PyTorch ≥ 2.0 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this.
OpenCV ≥ 4.6 is needed by demo and visualization.

Conda environment setup

conda create --name instdiff python=3.8 -y
conda activate instdiff

pip install -r requirements.txt

Training Data Generation

See Preparing Datasets for InstanceDiffusion.

Method Overview

InstanceDiffusion enhances text-to-image models by providing additional instance-level control. In additon to a global text prompt, InstanceDiffusion allows for paired instance-level prompts and their locations (e.g. points, boxes, scribbles or instance masks) to be specified when generating images. We add our proposed learnable UniFusion blocks to handle the additional per-instance conditioning. UniFusion fuses the instance conditioning with the backbone and modulate its features to enable instance conditioned image generation. Additionally, we propose ScaleU blocks that improve the UNet’s ability to respect instance-conditioning by rescaling the skip-connection and backbone feature maps produced in the UNet. At inference, we propose Multi-instance Sampler which reduces information leakage across multiple instances.

Please check our paper and project page for more details.

InstanceDiffusion Inference Demons (w/ Diffusers)

InstanceDiffusion has benn ported to the diffusers library (thanks to Kyeongryeol Go)! You can simply use the following commands to run InstanceDiffusion locally. Please refer to the model card for more details.

Install

git clone -b instancediffusion https://github.com/gokyeongryeol/diffusers.git
cd diffusers & pip install -e .

Example Usage

import torch
from diffusers import StableDiffusionINSTDIFFPipeline

pipe = StableDiffusionINSTDIFFPipeline.from_pretrained(
    "kyeongry/instancediffusion_sd15",
    # variant="fp16", torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a yellow American robin, brown Maltipoo dog, a gray British Shorthair in a stream, alongside with trees and rocks"
negative_prompt = "longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality"

# normalized (xmin,ymin,xmax,ymax)
boxes = [
    [0.0, 0.099609375, 0.349609375, 0.548828125],
    [0.349609375, 0.19921875, 0.6484375, 0.498046875],
    [0.6484375, 0.19921875, 0.998046875, 0.697265625],
    [0.0, 0.69921875, 1.0, 0.998046875],
]
phrases = [
    "a gray British Shorthair standing on a rock in the woods",
    "a yellow American robin standing on the rock",
    "a brown Maltipoo dog standing on the rock",
    "a close up of a small waterfall in the woods",
]     

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    instdiff_phrases=phrases,
    instdiff_boxes=boxes,
    instdiff_scheduled_sampling_alpha=0.8,  # proportion of using gated-self-attention
    instdiff_scheduled_sampling_beta=0.36,  # proportion of using multi-instance sampler
    guidance_scale=7.5,
    output_type="pil",
    num_inference_steps=50,
).images[0]

image.save("./instancediffusion-sd15-layout2image-generation.jpg")

InstanceDiffusion Inference Demons (w/ CLI)

If you want to run InstanceDiffusion demos locally, we provide inference.py. Please download the pretrained InstanceDiffusion from Hugging Face or Google Drive and SD1.5, place them under pretrained folder and then run it with:

python inference.py \
  --num_images 8 \
  --output OUTPUT/ \
  --input_json demos/demo_cat_dog_robin.json \
  --ckpt pretrained/instancediffusion_sd15.pth \
  --test_config configs/test_box.yaml \
  --guidance_scale 7.5 \
  --alpha 0.8 \
  --seed 0 \
  --mis 0.36 \
  --cascade_strength 0.4 \

The JSON file input_json specifies text prompts and location conditions for generating images, with several demo JSON files available under the demos directory. The num_images parameter indicates how many images to generate. The mis setting adjusts the proportion of timesteps utilizing multi-instance sampler, recommended to be below 0.4. A higher mis value can decrease information leakage between instances and improve image quality, but may also slow the generation process. Adjusting alpha modifies the fraction of timesteps using instance-level conditions, where a higher alpha ensures better adherence to location conditions at the potential cost of image quality, there is a trade-off. The SDXL refiner is activated if the cascade_strength is larger than 0. Note: The SDXL-Refiner was NOT employed for quantitative evaluations in the paper, but we recently found that it can improve the image generation quality.

Our implementation supports Flash/Math/MemEfficient attention, utilizing PyTorch's torch.backends.cuda.sdp_kernel. To disable it, simply set efficient_attention: False in the configuration .yaml file.

The bounding box should follow the format [xmin, ymin, width, height]. The mask is expected in RLE (Run-Length Encoding) format. Scribbles should be specified as [x1, y1,..., x20, y20] and can have duplicated points, and a point is denoted by [x, y].

Let's Get Everybody Turning Heads!

InstanceDiffusion supports image compositions with granularity spanning from entire instances to parts and subparts. The positioning of parts/subparts can implicitly alter the overall pose of the object.

https://github.com/frank-xwang/InstanceDiffusion/assets/58996472/1c4205a5-c3c4-4605-9fbd-c7023d4a4768

python inference.py \
  --num_images 8 \
  --output OUTPUT/ \
  --input_json demos/eagle_left.json \
  --ckpt pretrained/instancediffusion_sd15.pth \
  --test_config configs/test_box.yaml \
  --guidance_scale 7.5 \
  --alpha 0.8 \
  --seed 0 \
  --mis 0.2 \
  --cascade_strength 0.4 \

Image Generation Using Single Points

InstanceDiffusion supports generating images using points (with one point each instance) and corresponding instance captions.

python inference.py \
  --num_images 8 \
  --output OUTPUT/ \
  --input_json demos/demo_corgi_kitchen.json \
  --ckpt pretrained/instancediffusion_sd15.pth \
  --test_config configs/test_point.yaml \
  --guidance_scale 7.5 \
  --alpha 0.8 \
  --seed 0 \
  --mis 0.2 \
  --cascade_strength 0.4 \

Iterative Image Generation

https://github.com/frank-xwang/InstanceDiffusion/assets/58996472/b161455a-6b21-4607-a59d-3a6dd19edab1

InstanceDiffusion can also support iterative image generation, with minimal changes to pre-generated instances and the overall scene. Using the identical initial noise and image caption, InstanceDiffusion can selectively introduce new in

Related Skills

qqbot-channel

344.1k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

99.8k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

344.1k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

frank-xwang

View profile

View on GitHub

GitHub Stars608

CategoryContent

Updated4d ago

Forks32

frank-xwang/InstanceDiffusion

Languages

Python

Security Score

95/100

Audited on Mar 28, 2026

No findings