ControlAR
[ICLR 2025] ControlAR: Controllable Image Generation with Autoregressive Models
Install / Use
/learn @hustvl/ControlARREADME
Zongming Li<sup>1,*</sup>, Tianheng Cheng<sup>1,*</sup>, Shoufa Chen<sup>2</sup>, Peize Sun<sup>2</sup>, Haocheng Shen<sup>3</sup>,Longjin Ran<sup>3</sup>, Xiaoxin Chen<sup>3</sup>, Wenyu Liu<sup>1</sup>, Xinggang Wang<sup>1,📧</sup>
<sup>1</sup> Huazhong University of Science and Technology, <sup>2</sup> The University of Hong Kong <sup>3</sup> vivo AI Lab
<b>ICLR 2025</b>
(* equal contribution, 📧 corresponding author)
</div> <div align="center"> <img src="./assets/vis.png"> </div>News
[2025-01-23]: Our ControlAR has been accepted by ICLR 2025 🚀 !
[2024-12-12]: We introduce a control strength factor, employ a larger control encoder(dinov2-base), and optimize text alignment capabilities along with generation diversity. New model weight: depth_base.safetensors and edge_base.safetensors. The edge_base.safetensors can handle three types of edges, including Canny, HED, and Lineart.
[2024-10-31]: The code and models have been released!
[2024-10-04]: We have released the technical report of ControlAR. Code, models, and demos are coming soon!
Highlights
-
ControlAR explores an effective yet simple conditional decoding strategy for adding spatial controls to autoregressive models, e.g., LlamaGen, from a sequence perspective.
-
ControlAR supports arbitrary-resolution image generation with autoregressive models without hand-crafted special tokens or resolution-aware prompts.
TODO
- [x] release code & models.
- [x] release demo code and HuggingFace demo: HuggingFace Spaces 🤗
Results
We provide both quantitative and qualitative comparisons with diffusion-based methods in the technical report!
<div align="center"> <img src="./assets/comparison.png"> </div>Models
We released checkpoints of text-to-image ControlAR on different controls and settings, i.e. arbitrary-resolution generation.
| AR Model | Type | Control encoder | Control | Arbitrary-Resolution | Checkpoint | | :--------| :--: | :-------------: | :-----: | :------------------: | :--------: | | LlamaGen-XL | t2i | DINOv2-small | Canny Edge | ✅ | ckpt | | LlamaGen-XL | t2i | DINOv2-small | Depth | ✅ | ckpt | | LlamaGen-XL | t2i | DINOv2-small | HED Edge | ❌ | ckpt | | LlamaGen-XL | t2i | DINOv2-small | Seg. Mask | ❌ | ckpt | | LlamaGen-XL | t2i | DINOv2-base | Edge (Canny, Hed, Lineart) | ❌ | ckpt | | LlamaGen-XL | t2i | DINOv2-base | Depth | ❌ | ckpt |
Getting Started
Installation
conda create -n ControlAR python=3.10
git clone https://github.com/hustvl/ControlAR.git
cd ControlAR
pip install torch==2.1.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip3 install -U openmim
mim install mmengine
mim install "mmcv==2.1.0"
pip3 install "mmsegmentation>=1.0.0"
pip3 install mmdet
git clone https://github.com/open-mmlab/mmsegmentation.git
Pretrained Checkpoints for ControlAR
|tokenizer| text encoder |LlamaGen-B|LlamaGen-L|LlamaGen-XL| |:-------:|:------------:|:--------:|:--------:|:---------:| |vq_ds16_t2i.pt|flan-t5-xl|c2i_B_256.pt|c2i_L_256.pt|t2i_XL_512.pt|
We recommend storing them in the following structures:
|---checkpoints
|---t2i
|---canny/canny_MR.safetensors
|---hed/hed.safetensors
|---depth/depth_MR.safetensors
|---seg/seg_cocostuff.safetensors
|---edge_base.safetensors
|---depth_base.safetensors
|---t5-ckpt
|---flan-t5-xl
|---config.json
|---pytorch_model-00001-of-00002.bin
|---pytorch_model-00002-of-00002.bin
|---pytorch_model.bin.index.json
|---tokenizer.json
|---vq
|---vq_ds16_c2i.pt
|---vq_ds16_t2i.pt
|---llamagen (Only necessary for training)
|---c2i_B_256.pt
|---c2i_L_256.pt
|---t2i_XL_stage2_512.pt
Demo
Coming soon...
Sample & Generation
1. Class-to-image genetation
python autoregressive/sample/sample_c2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_c2i.pt \
--gpt-ckpt checkpoints/c2i/canny/LlamaGen-L.pt \
--gpt-model GPT-L --seed 0 --condition-type canny
2. Text-to-image generation
Generate an image using HED edge and text-to-image ControlAR:
python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/hed/hed.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type hed --seed 0 --condition-path condition/example/t2i/multigen/eye.png
Generate an image using segmentation mask and text-to-image ControlAR:
python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/seg/seg_cocostuff.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type seg --seed 0 --condition-path condition/example/t2i/cocostuff/doll.png \
--prompt 'A stuffed animal wearing a mask and a leash, sitting on a pink blanket'
3. Text-to-image generation with adjustable control strength
Generate an image using depth map and text-to-image ControlAR:
python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/depth_base.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type seg --seed 0 --condition-path condition/example/t2i/multigen/bird.jpg \
--prompt 'A bird made of blue crystal' \
--adapter-size base \
--control-strength 0.6
Generate an image using lineart edge and text-to-image ControlAR:
python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/edge_base.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type lineart --seed 0 --condition-path condition/example/t2i/multigen/girl.jpg \
--prompt 'A girl with blue hair' \
--adapter-size base \
--control-strength 0.6
(you can change lineart to canny_base or hed)
4. Arbitrary-resolution generation
python3 autoregressive/sample/sample_t2i_MR.py --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/depth_MR.safetensors --gpt-model GPT-XL --image-size 768 \
--condition-type depth --condition-path condition/example/t2i/multi_resolution/bird.jpg \
--prompt 'colorful bird' --seed 0
python3 autoregressive/sample/sample_t2i_MR.py --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/canny_MR.safetensors --gpt-model GPT-XL --image-size 768 \
--condition-type canny --condition-path condition/example/t2i/multi_resolution/bird.jpg \
--prompt 'colorful bird' --seed 0
Preparing Datasets
We provide the dataset datails for evaluation and training. If you don't want to train ControlAR, just download the validation splits.
1. Class-to-image
- Download ImageNet and save it to
data/imagenet/data.
2. Text-to-image
- Download ADE20K with caption(~7GB) and save the
.parquetfiles todata/Captioned_ADE20K/data. - Download COCOStuff with caption( ~62GB) and save the .parquet files to
data/Captioned_COCOStuff/data. - Download MultiGen-20M( ~1.22TB) and save the .parquet files to
data/MultiGen20M/data.
3. Preprocessing datasets
To save training time, we adopt the tokenizer to pre-process the images with the text prompts.
- ImageNet
bash scripts/autoregressive/extract_file_imagenet.sh \
--vq-ckpt checkpoints/vq/vq_ds16_c2i.pt \
--data-path data/imagenet/data/val \
--code-path data/imagenet/val/imagenet_code_c2i_flip_ten_crop \
--ten-crop --crop-range 1.1 --image-size 256
Related Skills
qqbot-channel
344.1kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
99.8k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
344.1kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
