ARPC

[ICLR 2026] Autoregressive-based Progressive Coding for Ultra-Low Bitrate Image Compression

Generate Convert Improve

Install / Use

/learn @Joanna-0421/ARPC

About this skill

Quality Score

0/100

README

Autoregressive-based Progressive Coding for Ultra-Low Bitrate Image Compression

⏰Todo

[x] Repo release
[ ] Update paper link
[x] Pretrained models
[x] Inference
[ ] Training

📖Abstract

Generative models have demonstrated significant results in ultra-low bitrate image compression, owing to their powerful capabilities for content generation and texture completion. Existing works primarily based on diffusion models still face challenges such as limited bitrate adaptability and high computational complexity for encoding and decoding. Inspired by the success of Visual AutoRegressive model (VAR), we introduce AutoRegressive-based Progressive Coding (ARPC) for ultra-low bitrate image compression, a progressive image compression framework based on next-scale prediction visual autoregressive model. Based on multi-scale residual vector quantizer, ARPC efficiently encodes the image into multi-scale discrete token maps and controls the bitrates by selecting different scales for transmission. For decompression, ARPC leverages the prior knowledge inherent in the visual autoregressive model to predict the unreceived scales, which is naturally the autoregressive generation process. To further increase the compression ratio, we target the VAR as a probability estimator for lossless entropy coding and propose group-masked bitwise multi-scale residual quantizer to adaptively allocate bits for different scales. Extensive experiments show that ARPC achieves state-of-the-art perceptual fidelity at ultra-low bitrates and high decompression efficiency compared with existing diffusion-based methods.

✅Main results

Rate-distortion-perception comparison on benchmarks:

Visual results:

⚙️Installation

conda env create -f environment.yaml
conda activate ARPC

💡Data Preparation

Training data

We use Coyo-700M as our training data. We first select images with a resolution greater than $1024 \times 1024$, and then exploit an OCR model to filter undesired images with too much text. We utilize the InternVL 2.0 model to re-caption all filtered images to provide more accurate and detailed annotations. Our final training dataset includes 5M highly curated images with detailed captions.

The dataset file structure as follow:

<PATH_TO_DATASETS>/id.JPEG

We prepare the .jsonl file for training with each json item containing the following information:

{
	"id":"id",
	"long_caption": "detailed caption",
	"long_caption_type":"caption-InternVL2.0",
	"text": "short caption"
}

Test data

In our paper, we adopt Kodak, DIV2K Validation Set, and CLIC2020 Test Set for evaluation.

We use BLIP model to generate the image captions.

We give an example with DIV2K dataset in data/DIV2K.json .

🔥Train

Stage1:

We train the image encoder and decoder with the group-masked bitwise multi-scale residual quantizer.

Stage2:

We use Infinity-2B as the visual autoregressive model, and finetune it for 20k iterations.

Download the Infinity-2B pretrained model and flan-t5-xl and save in weights/ .

bash train.sh

⏭️Inference

Download pretrained models and save in weights/ :

Download the image encoder, decoder and GM-BMSRQ: checkpoint.
Download the visual autoregressive model: checkpoint.

python demo.py

Related Skills

node-connect

351.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

351.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

351.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。