SkillAgentSearch skills...

InstanceDiffusion

[CVPR 2024] Code release for "InstanceDiffusion: Instance-level Control for Image Generation"

Install / Use

/learn @frank-xwang/InstanceDiffusion
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

InstanceDiffusion: Instance-level Control for Image Generation

We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. InstanceDiffusion supports free-form language conditions per instance and allows flexible ways to specify instance locations such as simple single points, scribbles, bounding boxes or intricate instance segmentation masks, and combinations thereof. Compared to the previous SOTA, InstanceDiffusion achieves 2.0 times higher AP50 for box inputs and 1.7 times higher IoU for mask inputs.

<p align="center"> <img src='docs/teaser.jpg' align="center" > </p>

InstanceDiffusion: Instance-level Control for Image Generation
XuDong Wang, Trevor Darrell, Saketh Rambhatla, Rohit Girdhar, Ishan Misra
GenAI, Meta; BAIR, UC Berkeley
CVPR 2024

[project page] [arxiv] [PDF] [bibtex]

Disclaimer

This repository represents a re-implementation of InstanceDiffusion conducted by the first author during his time at UC Berkeley. Minor performance discrepancies may exist compared to the results reported in the original paper. The goal of this repository is to replicate the original paper's findings and insights, primarily for academic and research purposes.

Updates

  • 01/01/2025 - InstanceDiffusion has been ported to the diffusers library (thanks to Kyeongryeol Go)! Please refer to the model card and the pull request for details.
  • 02/25/2024 - InstanceDiffusion is ported into ComfyUI. Check out some cool video demos! (thanks to Tucker Darby)
  • 02/21/2024 - Support flash attention, memory usage can be reduced by more than half.
  • 02/19/2024 - Add PiM evaluation for scribble-/point-based image generation
  • 02/10/2024 - Add model evaluation on attribute binding
  • 02/09/2024 - Add model evaluation using the MSCOCO dataset
  • 02/05/2024 - Initial commit. Stay tuned

Installation

Requirements

  • Linux or macOS with Python ≥ 3.8
  • PyTorch ≥ 2.0 and torchvision that matches the PyTorch installation. Install them together at pytorch.org to make sure of this.
  • OpenCV ≥ 4.6 is needed by demo and visualization.

Conda environment setup

conda create --name instdiff python=3.8 -y
conda activate instdiff

pip install -r requirements.txt

Training Data Generation

See Preparing Datasets for InstanceDiffusion.

Method Overview

<p align="center"> <img src="docs/InstDiff-gif.gif" width=70%> </p> <p align="center"> <img src="docs/results.png" width=100%> </p>

InstanceDiffusion enhances text-to-image models by providing additional instance-level control. In additon to a global text prompt, InstanceDiffusion allows for paired instance-level prompts and their locations (e.g. points, boxes, scribbles or instance masks) to be specified when generating images. We add our proposed learnable UniFusion blocks to handle the additional per-instance conditioning. UniFusion fuses the instance conditioning with the backbone and modulate its features to enable instance conditioned image generation. Additionally, we propose ScaleU blocks that improve the UNet’s ability to respect instance-conditioning by rescaling the skip-connection and backbone feature maps produced in the UNet. At inference, we propose Multi-instance Sampler which reduces information leakage across multiple instances.

Please check our paper and project page for more details.

InstanceDiffusion Inference Demons (w/ Diffusers)

InstanceDiffusion has benn ported to the diffusers library (thanks to Kyeongryeol Go)! You can simply use the following commands to run InstanceDiffusion locally. Please refer to the model card for more details.

Install

git clone -b instancediffusion https://github.com/gokyeongryeol/diffusers.git
cd diffusers & pip install -e .

Example Usage

import torch
from diffusers import StableDiffusionINSTDIFFPipeline

pipe = StableDiffusionINSTDIFFPipeline.from_pretrained(
    "kyeongry/instancediffusion_sd15",
    # variant="fp16", torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a yellow American robin, brown Maltipoo dog, a gray British Shorthair in a stream, alongside with trees and rocks"
negative_prompt = "longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality"

# normalized (xmin,ymin,xmax,ymax)
boxes = [
    [0.0, 0.099609375, 0.349609375, 0.548828125],
    [0.349609375, 0.19921875, 0.6484375, 0.498046875],
    [0.6484375, 0.19921875, 0.998046875, 0.697265625],
    [0.0, 0.69921875, 1.0, 0.998046875],
]
phrases = [
    "a gray British Shorthair standing on a rock in the woods",
    "a yellow American robin standing on the rock",
    "a brown Maltipoo dog standing on the rock",
    "a close up of a small waterfall in the woods",
]     

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    instdiff_phrases=phrases,
    instdiff_boxes=boxes,
    instdiff_scheduled_sampling_alpha=0.8,  # proportion of using gated-self-attention
    instdiff_scheduled_sampling_beta=0.36,  # proportion of using multi-instance sampler
    guidance_scale=7.5,
    output_type="pil",
    num_inference_steps=50,
).images[0]

image.save("./instancediffusion-sd15-layout2image-generation.jpg")

InstanceDiffusion Inference Demons (w/ CLI)

If you want to run InstanceDiffusion demos locally, we provide inference.py. Please download the pretrained InstanceDiffusion from Hugging Face or Google Drive and SD1.5, place them under pretrained folder and then run it with:

python inference.py \
  --num_images 8 \
  --output OUTPUT/ \
  --input_json demos/demo_cat_dog_robin.json \
  --ckpt pretrained/instancediffusion_sd15.pth \
  --test_config configs/test_box.yaml \
  --guidance_scale 7.5 \
  --alpha 0.8 \
  --seed 0 \
  --mis 0.36 \
  --cascade_strength 0.4 \

The JSON file input_json specifies text prompts and location conditions for generating images, with several demo JSON files available under the demos directory. The num_images parameter indicates how many images to generate. The mis setting adjusts the proportion of timesteps utilizing multi-instance sampler, recommended to be below 0.4. A higher mis value can decrease information leakage between instances and improve image quality, but may also slow the generation process. Adjusting alpha modifies the fraction of timesteps using instance-level conditions, where a higher alpha ensures better adherence to location conditions at the potential cost of image quality, there is a trade-off. The SDXL refiner is activated if the cascade_strength is larger than 0. Note: The SDXL-Refiner was NOT employed for quantitative evaluations in the paper, but we recently found that it can improve the image generation quality.

Our implementation supports Flash/Math/MemEfficient attention, utilizing PyTorch's torch.backends.cuda.sdp_kernel. To disable it, simply set efficient_attention: False in the configuration .yaml file.

The bounding box should follow the format [xmin, ymin, width, height]. The mask is expected in RLE (Run-Length Encoding) format. Scribbles should be specified as [x1, y1,..., x20, y20] and can have duplicated points, and a point is denoted by [x, y].

Let's Get Everybody Turning Heads!

InstanceDiffusion supports image compositions with granularity spanning from entire instances to parts and subparts. The positioning of parts/subparts can implicitly alter the overall pose of the object.

https://github.com/frank-xwang/InstanceDiffusion/assets/58996472/1c4205a5-c3c4-4605-9fbd-c7023d4a4768

python inference.py \
  --num_images 8 \
  --output OUTPUT/ \
  --input_json demos/eagle_left.json \
  --ckpt pretrained/instancediffusion_sd15.pth \
  --test_config configs/test_box.yaml \
  --guidance_scale 7.5 \
  --alpha 0.8 \
  --seed 0 \
  --mis 0.2 \
  --cascade_strength 0.4 \

Image Generation Using Single Points

InstanceDiffusion supports generating images using points (with one point each instance) and corresponding instance captions.

<p align="center"> <img src="docs/InstDiff-points.png" width=95%> </p>
python inference.py \
  --num_images 8 \
  --output OUTPUT/ \
  --input_json demos/demo_corgi_kitchen.json \
  --ckpt pretrained/instancediffusion_sd15.pth \
  --test_config configs/test_point.yaml \
  --guidance_scale 7.5 \
  --alpha 0.8 \
  --seed 0 \
  --mis 0.2 \
  --cascade_strength 0.4 \

Iterative Image Generation

https://github.com/frank-xwang/InstanceDiffusion/assets/58996472/b161455a-6b21-4607-a59d-3a6dd19edab1

InstanceDiffusion can also support iterative image generation, with minimal changes to pre-generated instances and the overall scene. Using the identical initial noise and image caption, InstanceDiffusion can selectively introduce new in

Related Skills

View on GitHub
GitHub Stars608
CategoryContent
Updated4d ago
Forks32

Languages

Python

Security Score

95/100

Audited on Mar 28, 2026

No findings