ControlAR

[ICLR 2025] ControlAR: Controllable Image Generation with Autoregressive Models

Generate Convert Improve

Install / Use

/learn @hustvl/ControlAR

About this skill

Quality Score

0/100

README

<div align ="center"> <img src="./assets/logo.jpeg" width="20%"> <h1> ControlAR </h1> <h3> Controllable Image Generation with Autoregressive Models </h3>

Zongming Li<sup>1,*</sup>, Tianheng Cheng<sup>1,*</sup>, Shoufa Chen<sup>2</sup>, Peize Sun<sup>2</sup>, Haocheng Shen<sup>3</sup>,Longjin Ran<sup>3</sup>, Xiaoxin Chen<sup>3</sup>, Wenyu Liu<sup>1</sup>, Xinggang Wang<sup>1,📧</sup>

<sup>1</sup> Huazhong University of Science and Technology, <sup>2</sup> The University of Hong Kong <sup>3</sup> vivo AI Lab

(* equal contribution, 📧 corresponding author)

</div> <div align="center"> <img src="./assets/vis.png"> </div>

News

[2025-01-23]: Our ControlAR has been accepted by ICLR 2025 🚀 !
[2024-12-12]: We introduce a control strength factor, employ a larger control encoder(dinov2-base), and optimize text alignment capabilities along with generation diversity. New model weight: depth_base.safetensors and edge_base.safetensors. The edge_base.safetensors can handle three types of edges, including Canny, HED, and Lineart.
[2024-10-31]: The code and models have been released!
[2024-10-04]: We have released the technical report of ControlAR. Code, models, and demos are coming soon!

Highlights

ControlAR explores an effective yet simple conditional decoding strategy for adding spatial controls to autoregressive models, e.g., LlamaGen, from a sequence perspective.
ControlAR supports arbitrary-resolution image generation with autoregressive models without hand-crafted special tokens or resolution-aware prompts.

TODO

[x] release code & models.
[x] release demo code and HuggingFace demo: HuggingFace Spaces 🤗

Results

We provide both quantitative and qualitative comparisons with diffusion-based methods in the technical report!

Models

We released checkpoints of text-to-image ControlAR on different controls and settings, i.e. arbitrary-resolution generation.

| AR Model | Type | Control encoder | Control | Arbitrary-Resolution | Checkpoint | | :--------| :--: | :-------------: | :-----: | :------------------: | :--------: | | LlamaGen-XL | t2i | DINOv2-small | Canny Edge | ✅ | ckpt | | LlamaGen-XL | t2i | DINOv2-small | Depth | ✅ | ckpt | | LlamaGen-XL | t2i | DINOv2-small | HED Edge | ❌ | ckpt | | LlamaGen-XL | t2i | DINOv2-small | Seg. Mask | ❌ | ckpt | | LlamaGen-XL | t2i | DINOv2-base | Edge (Canny, Hed, Lineart) | ❌ | ckpt | | LlamaGen-XL | t2i | DINOv2-base | Depth | ❌ | ckpt |

Getting Started

Installation

conda create -n ControlAR python=3.10
git clone https://github.com/hustvl/ControlAR.git
cd ControlAR
pip install torch==2.1.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
pip3 install -U openmim 
mim install mmengine 
mim install "mmcv==2.1.0"
pip3 install "mmsegmentation>=1.0.0"
pip3 install mmdet
git clone https://github.com/open-mmlab/mmsegmentation.git

Pretrained Checkpoints for ControlAR

We recommend storing them in the following structures:

|---checkpoints
      |---t2i
            |---canny/canny_MR.safetensors
            |---hed/hed.safetensors
            |---depth/depth_MR.safetensors
            |---seg/seg_cocostuff.safetensors
            |---edge_base.safetensors
            |---depth_base.safetensors
      |---t5-ckpt
            |---flan-t5-xl
                  |---config.json
                  |---pytorch_model-00001-of-00002.bin
                  |---pytorch_model-00002-of-00002.bin
                  |---pytorch_model.bin.index.json
                  |---tokenizer.json
      |---vq
            |---vq_ds16_c2i.pt
            |---vq_ds16_t2i.pt
      |---llamagen (Only necessary for training)
            |---c2i_B_256.pt
            |---c2i_L_256.pt
            |---t2i_XL_stage2_512.pt

Demo

Coming soon...

Sample & Generation

1. Class-to-image genetation

python autoregressive/sample/sample_c2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_c2i.pt \
--gpt-ckpt checkpoints/c2i/canny/LlamaGen-L.pt \
--gpt-model GPT-L --seed 0 --condition-type canny

2. Text-to-image generation

Generate an image using HED edge and text-to-image ControlAR:

python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/hed/hed.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type hed --seed 0 --condition-path condition/example/t2i/multigen/eye.png

Generate an image using segmentation mask and text-to-image ControlAR:

python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/seg/seg_cocostuff.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type seg --seed 0 --condition-path condition/example/t2i/cocostuff/doll.png \
--prompt 'A stuffed animal wearing a mask and a leash, sitting on a pink blanket'

3. Text-to-image generation with adjustable control strength

Generate an image using depth map and text-to-image ControlAR:

python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/depth_base.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type seg --seed 0 --condition-path condition/example/t2i/multigen/bird.jpg \
--prompt 'A bird made of blue crystal' \
--adapter-size base \
--control-strength 0.6

Generate an image using lineart edge and text-to-image ControlAR:

python autoregressive/sample/sample_t2i.py \
--vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/edge_base.safetensors \
--gpt-model GPT-XL --image-size 512 \
--condition-type lineart --seed 0 --condition-path condition/example/t2i/multigen/girl.jpg \
--prompt 'A girl with blue hair' \
--adapter-size base \
--control-strength 0.6

(you can change lineart to canny_base or hed)

4. Arbitrary-resolution generation

python3 autoregressive/sample/sample_t2i_MR.py --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/depth_MR.safetensors --gpt-model GPT-XL --image-size 768 \
--condition-type depth --condition-path condition/example/t2i/multi_resolution/bird.jpg \
--prompt 'colorful bird' --seed 0

python3 autoregressive/sample/sample_t2i_MR.py --vq-ckpt checkpoints/vq/vq_ds16_t2i.pt \
--gpt-ckpt checkpoints/t2i/canny_MR.safetensors --gpt-model GPT-XL --image-size 768 \
--condition-type canny --condition-path condition/example/t2i/multi_resolution/bird.jpg \
--prompt 'colorful bird' --seed 0

Preparing Datasets

We provide the dataset datails for evaluation and training. If you don't want to train ControlAR, just download the validation splits.

1. Class-to-image

Download ImageNet and save it to data/imagenet/data.

2. Text-to-image

Download ADE20K with caption(~7GB) and save the .parquet files to data/Captioned_ADE20K/data.
Download COCOStuff with caption( ~62GB) and save the .parquet files to data/Captioned_COCOStuff/data.
Download MultiGen-20M( ~1.22TB) and save the .parquet files to data/MultiGen20M/data.

3. Preprocessing datasets

To save training time, we adopt the tokenizer to pre-process the images with the text prompts.

ImageNet

bash scripts/autoregressive/extract_file_imagenet.sh \
--vq-ckpt checkpoints/vq/vq_ds16_c2i.pt \
--data-path data/imagenet/data/val \
--code-path data/imagenet/val/imagenet_code_c2i_flip_ten_crop \
--ten-crop --crop-range 1.1 --image-size 256

Related Skills

qqbot-channel

344.1k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

99.8k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

344.1k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

hustvl

View profile

View on GitHub

GitHub Stars325

CategoryContent

Updated7d ago

Forks10

hustvl/ControlAR

Languages

Python

Security Score

95/100

Audited on Mar 25, 2026

No findings