ARPC
[ICLR 2026] Autoregressive-based Progressive Coding for Ultra-Low Bitrate Image Compression
Install / Use
/learn @Joanna-0421/ARPCREADME
Autoregressive-based Progressive Coding for Ultra-Low Bitrate Image Compression
⏰Todo
- [x] Repo release
- [ ] Update paper link
- [x] Pretrained models
- [x] Inference
- [ ] Training
📖Abstract
Generative models have demonstrated significant results in ultra-low bitrate image compression, owing to their powerful capabilities for content generation and texture completion. Existing works primarily based on diffusion models still face challenges such as limited bitrate adaptability and high computational complexity for encoding and decoding. Inspired by the success of Visual AutoRegressive model (VAR), we introduce AutoRegressive-based Progressive Coding (ARPC) for ultra-low bitrate image compression, a progressive image compression framework based on next-scale prediction visual autoregressive model. Based on multi-scale residual vector quantizer, ARPC efficiently encodes the image into multi-scale discrete token maps and controls the bitrates by selecting different scales for transmission. For decompression, ARPC leverages the prior knowledge inherent in the visual autoregressive model to predict the unreceived scales, which is naturally the autoregressive generation process. To further increase the compression ratio, we target the VAR as a probability estimator for lossless entropy coding and propose group-masked bitwise multi-scale residual quantizer to adaptively allocate bits for different scales. Extensive experiments show that ARPC achieves state-of-the-art perceptual fidelity at ultra-low bitrates and high decompression efficiency compared with existing diffusion-based methods.
✅Main results
Rate-distortion-perception comparison on benchmarks:

Visual results:


⚙️Installation
conda env create -f environment.yaml
conda activate ARPC
💡Data Preparation
Training data
We use Coyo-700M as our training data. We first select images with a resolution greater than $1024 \times 1024$, and then exploit an OCR model to filter undesired images with too much text. We utilize the InternVL 2.0 model to re-caption all filtered images to provide more accurate and detailed annotations. Our final training dataset includes 5M highly curated images with detailed captions.
The dataset file structure as follow:
<PATH_TO_DATASETS>/id.JPEG
We prepare the .jsonl file for training with each json item containing the following information:
{
"id":"id",
"long_caption": "detailed caption",
"long_caption_type":"caption-InternVL2.0",
"text": "short caption"
}
Test data
In our paper, we adopt Kodak, DIV2K Validation Set, and CLIC2020 Test Set for evaluation.
We use BLIP model to generate the image captions.
We give an example with DIV2K dataset in data/DIV2K.json .
🔥Train
Stage1:
We train the image encoder and decoder with the group-masked bitwise multi-scale residual quantizer.
Stage2:
We use Infinity-2B as the visual autoregressive model, and finetune it for 20k iterations.
Download the Infinity-2B pretrained model and flan-t5-xl and save in weights/ .
bash train.sh
⏭️Inference
Download pretrained models and save in weights/ :
- Download the image encoder, decoder and GM-BMSRQ: checkpoint.
- Download the visual autoregressive model: checkpoint.
python demo.py
Related Skills
node-connect
351.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
351.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
351.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
