SkillAgentSearch skills...

ControlAR

[ICLR 2025] ControlAR: Controllable Image Generation with Autoregressive Models

Install / Use

/learn @hustvl/ControlAR
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align ="center"> <img src="./assets/logo.jpeg" width="20%"> <h1> ControlAR </h1> <h3> Controllable Image Generation with Autoregressive Models </h3>

Zongming Li<sup>1,*</sup>, Tianheng Cheng<sup>1,*</sup>, Shoufa Chen<sup>2</sup>, Peize Sun<sup>2</sup>, Haocheng Shen<sup>3</sup>,Longjin Ran<sup>3</sup>, Xiaoxin Chen<sup>3</sup>, Wenyu Liu<sup>1</sup>, Xinggang Wang<sup>1,📧</sup>

<sup>1</sup> Huazhong University of Science and Technology, <sup>2</sup> The University of Hong Kong <sup>3</sup> vivo AI Lab

<b>ICLR 2025</b>

(* equal contribution, 📧 corresponding author)

arxiv paper demo checkpoints

</div> <div align="center"> <img src="./assets/vis.png"> </div>

News

[2025-01-23]: Our ControlAR has been accepted by ICLR 2025 🚀 !
[2024-12-12]: We introduce a control strength factor, employ a larger control encoder(dinov2-base), and optimize text alignment capabilities along with generation diversity. New model weight: depth_base.safetensors and edge_base.safetensors. The edge_base.safetensors can handle three types of edges, including Canny, HED, and Lineart.
[2024-10-31]: The code and models have been released!
[2024-10-04]: We have released the technical report of ControlAR. Code, models, and demos are coming soon!

Highlights

  • ControlAR explores an effective yet simple conditional decoding strategy for adding spatial controls to autoregressive models, e.g., LlamaGen, from a sequence perspective.

  • ControlAR supports arbitrary-resolution image generation with autoregressive models without hand-crafted special tokens or resolution-aware prompts.

TODO

Results

We provide both quantitative and qualitative comparisons with diffusion-based methods in the technical report!

<div align="center"> <img src="./assets/comparison.png"> </div>

Models

We released checkpoints of text-to-image ControlAR on different controls and settings, i.e. arbitrary-resolution generation.

| AR Model | Type | Control encoder | Control | Arbitrary-Resolution | Checkpoint | | :--------| :--: | :-------------: | :-----: | :------------------: | :--------: | | LlamaGen-XL | t2i | DINOv2-small | Canny Edge | ✅ | ckpt | | LlamaGen-XL | t2i | DINOv2-small | Depth | ✅ | ckpt | | LlamaGen-XL | t2i | DINOv2-small | HED Edge | ❌ | ckpt | | LlamaGen-XL | t2i | DINOv2-small | Seg. Mask | ❌ | ckpt | | LlamaGen-XL | t2i | DINOv2-base | Edge (Canny, Hed, Lineart) | ❌ | ckpt | | LlamaGen-XL | t2i | DINOv2-base | Depth | ❌ | ckpt |

Getting Started

Installation

conda create -n ControlAR python=3.10
git clone https://github.com/hustvl/ControlAR.git
cd ControlAR
pip install torch==2.1.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip3 install -U openmim 
mim install mmengine 
mim install "mmcv==2.1.0"
pip3 install "mmsegmentation>=1.0.0"
pip3 install mmdet
git clone https://github.com/open-mmlab/mmsegmentation.git

Pretrained Checkpoints for ControlAR

|tokenizer| text encoder |LlamaGen-B|LlamaGen-L|LlamaGen-XL| |:-------:|:------------:|:--------:|:--------:|:---------:| |vq_ds16_t2i.pt|flan-t5-xl|c2i_B_256.pt|c2i_L_256.pt|t2i_XL_512.pt|

We recommend storing them in the following structures:

|---checkpoints
      |---t2i
            |---canny/canny_MR.safetensors
            |---hed/hed.safetensors
            |---depth/depth_MR.safetensors
            |---seg/seg_cocostuff.safetensors
            |---edge_base.safetensors
            |---depth_base.safetensors
      |---t5-ckpt
            |---flan-t5-xl
                  |---config.json
                  |---pytorch_model-00001-of-00002.bin
                  |---pytorch_model-00002-of-00002.bin
                  |---pytorch_model.bin.index.json
                  |---tokenizer.json
      |---vq
            |---vq_ds16_c2i.pt
            |---vq_ds16_t2i.pt
      |---llamagen (Only necessary for training)
            |---c2i_B_256.pt
            |---c2i_L_256.pt
            |---t2i_XL_stage2_512.pt

Demo

Coming soon...

Sample & Generation

1. Class-to-image genetation

python autoregressive/sample/sample_c2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_c2i.pt \
--gpt-ckpt checkpoints/c2i/canny/LlamaGen-L.pt \
--gpt-model GPT-L --seed 0 --condition-type canny

2. Text-to-image generation

Generate an image using HED edge and text-to-image ControlAR:

python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/hed/hed.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type hed --seed 0 --condition-path condition/example/t2i/multigen/eye.png

Generate an image using segmentation mask and text-to-image ControlAR:

python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/seg/seg_cocostuff.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type seg --seed 0 --condition-path condition/example/t2i/cocostuff/doll.png \
--prompt 'A stuffed animal wearing a mask and a leash, sitting on a pink blanket'

3. Text-to-image generation with adjustable control strength

Generate an image using depth map and text-to-image ControlAR:

python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/depth_base.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type seg --seed 0 --condition-path condition/example/t2i/multigen/bird.jpg \
--prompt 'A bird made of blue crystal' \
--adapter-size base \
--control-strength 0.6

Generate an image using lineart edge and text-to-image ControlAR:

python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/edge_base.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type lineart --seed 0 --condition-path condition/example/t2i/multigen/girl.jpg \
--prompt 'A girl with blue hair' \
--adapter-size base \
--control-strength 0.6

(you can change lineart to canny_base or hed)

4. Arbitrary-resolution generation

python3 autoregressive/sample/sample_t2i_MR.py --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/depth_MR.safetensors --gpt-model GPT-XL --image-size 768 \
--condition-type depth --condition-path condition/example/t2i/multi_resolution/bird.jpg \
--prompt 'colorful bird' --seed 0
python3 autoregressive/sample/sample_t2i_MR.py --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/canny_MR.safetensors --gpt-model GPT-XL --image-size 768 \
--condition-type canny --condition-path condition/example/t2i/multi_resolution/bird.jpg \
--prompt 'colorful bird' --seed 0

Preparing Datasets

We provide the dataset datails for evaluation and training. If you don't want to train ControlAR, just download the validation splits.

1. Class-to-image

  • Download ImageNet and save it to data/imagenet/data.

2. Text-to-image

  • Download ADE20K with caption(~7GB) and save the .parquet files to data/Captioned_ADE20K/data.
  • Download COCOStuff with caption( ~62GB) and save the .parquet files to data/Captioned_COCOStuff/data.
  • Download MultiGen-20M( ~1.22TB) and save the .parquet files to data/MultiGen20M/data.

3. Preprocessing datasets

To save training time, we adopt the tokenizer to pre-process the images with the text prompts.

  • ImageNet
bash scripts/autoregressive/extract_file_imagenet.sh \
--vq-ckpt checkpoints/vq/vq_ds16_c2i.pt \
--data-path data/imagenet/data/val \
--code-path data/imagenet/val/imagenet_code_c2i_flip_ten_crop \
--ten-crop --crop-range 1.1 --image-size 256

Related Skills

View on GitHub
GitHub Stars325
CategoryContent
Updated7d ago
Forks10

Languages

Python

Security Score

95/100

Audited on Mar 25, 2026

No findings