PerCo
PyTorch implementation of PerCo (Towards Image Compression with Perfect Realism at Ultra-Low Bitrates, ICLR 2024)
Install / Use
/learn @Nikolai10/PerCoREADME
Perceptual Compression (PerCo)
<img src="https://colab.research.google.com/assets/colab-badge.svg" align="center">
This repository provides a PyTorch implementation of PerCo based on:
<p align="center"> <img src="https://github.com/Nikolai10/PerCo/blob/master/res/doc/figures/Teaser_PerCo.png" width="80%" /> </p>Different from the original work, we use Stable Diffusion v2.1 (Rombach et al., CVPR 2022) as latent diffusion model and hence refer to our work as PerCo (SD). This is to differentiate from the official work, which is based on a proprietary, not publicly available, pre-trained variant based on GLIDE (Nichol et al., ICML 2022).
Under active development.
Updates
06/16/2024
- Finetuned whole U-Net (not just linear layers)
- Slightly improved results (limited to 50k optimization steps)
- Released pre-trained models
- Ablation studies: experimented with LoRA and FSQ (no improvements achieved)
05/29/2024
- Switched back to official hyper-encoder design, resolved training instabilities
- Significantly improved results (limited to 50k optimization steps)
05/24/2024
- Initial release of this project
Visual Impressions
Visual Comparison on the Kodak dataset, for our lowest bit-rate (0.0019bpp). Column 1: ground truth. Columns 2-5: set of reconstructions that reflect the uncertainty about the original image source.
<div align="center"> <img src="./res/doc/figures/0.0019_kodim13_a river runs through a rocky forest with mountains in the background.png" width="95%" alt="0.0019_kodim13_a river runs through a rocky forest with mountains in the background.png"> </div>Global conditioning: "a river runs through a rocky forest with mountains in the background".
<div align="center">
<img src="./res/doc/figures/0.0019_kodim22_a red barn with a pond in the background.png" width="95%" alt="0.0019_kodim22_a red barn with a pond in the background.png">
</div>
Global conditioning: "a red barn with a pond in the background".
<div align="center">
<img src="./res/doc/figures/0.0019_kodim23_two parrots standing next to each other with leaves in the background.png" width="95%" alt="0.0019_kodim23_two parrots standing next to each other with leaves in the background.png">
</div>
Global conditioning: "two parrots standing next to each other with leaves in the background".
More visual results can be found here.
Quantitative Performance
In this section we quantitatively compare the performance of PerCo (SD v2.1) to the officially reported numbers. All models were trained using a reduced set of optimization steps (50k). Note that the performance is bounded by the LDM auto-encoder, denoted as SD v2.1 auto-encoder.
We generally obtain highly competitive results in terms of perception (FID, KID), especially for the ultra-low bit-rates, but at the cost of lower image fidelity (MS-SSIM, LPIPS). Note that PerCo (official) was trained using 5 epochs (9M training samples / batch size 160 * 5 epochs = 281250 optimization steps) vs. 50k steps, which roughly corresponds to 18%. Also note that we have not yet considered LPIPS as an auxiliary loss, which is known to increase performance at higher bit-rates.
We will continue our experiments and hope to release more powerful variants at a later stage.
<p align="center"> <img src="./res/doc/figures/perco_reimpl.png" alt="PerCo (official) vs. PerCo (SDv2.1)" width="95%" /> </p>Install
$ git clone https://github.com/Nikolai10/PerCo.git
Please follow our Installation Guide with Docker.
Training/ Inference/ Evaluation
Please have a look at the example notebook for more information.
We use the OpenImagesV6 training dataset by default, similar to MS-ILLM. Please familiarize yourself with the data loading mechanisms (see _openimages_v6.py) and adjust the file paths and training settings in config.py accordingly. Corrupted images must be excluded, see _INVALID_IMAGE_NAMES for more details.
We also provide a simplified Google Colab demo that integrates any tfds dataset (e.g. CLIC 2020), with no data engineering tasks involved: open tutorial.
TODOs
- [x] Compression functionality
- [x] adopt script logic presented in MS2020
- [x] provide decompression functionality as custom HuggingFace pipeline
- [x] add zlib compression functionality (captions)
- [x] add entropy coding functionality (hyper-encoder)
- [x] use DDIM scheduler for inference (20/5 denoising steps)
- [x] Provide evaluation code/ compare quantitatively to PerCo (official)
- [x] Training pipeline
- [x] use train_text_to_image.py as starting point
- [x] integrate tfds to make use of Open Images v4 (1.7M images)
- [x] integrate full OpenImagesV6 (9M images) based on NeuralCompression
- [x] obtain captions dynamically at runtime
- [x] adjust conditioning logic (z_l, z_g)
- [x] optimizer AdamW
- [x] 5 epochs, on 512x512 crops (for now: limited to 50k iterations)
- [x] peak learning rate ~~1e-4~~ -> we use 1e-5
- [x] weight decay 0.01
- [x] bs = 160 (w/o LPIPS), bs = 40 (w/ LPIPS)
- [x] linear warm-up 10k
- [x] train hyper-encoder + finetune ~~linear~~ all layers of U-Net
- [x] exchange traditional noise prediction objective with v-prediction
- [x] add LPIPS loss for target rates > 0.05bpp
- [x] add classifier-free guidance (drop text-conditioning in 10% of iterations)
- [x] override validation logic (add validation images)
- [x] BLIP 2
- [x] add Salesforce/blip2-opt-2.7b (and variants)
- [x] max caption length 32 tokens
- [x] Hyper-encoder
- [x] request hyper-encoder design from authors
- [x] integrate improved VQ-VAE functionality (Yu et al. ICLR 2022)
- [x] wrap into (ModelMixin, ConfigMixin) to make use of convenient loading/ saving
- [x] U-Net
- [x] extend the kernel of the first conv layer
- [x] initialize newly created variables randomly
Note:
- we have not adjusted the finetuning grid to 50 timesteps as described in the paper.
- we use Stable Diffusion v2.1 as LDM, due to its native shift from epsilon to v-prediction. In general, however, this project also supports SD 1.X variants with minor adjustments:
from helpers import update_scheduler pipe = StableDiffusionPipelinePerco.from_pretrained(...) # add this line if you are using v-prediction update_scheduler(pipe)
Pre-trained Models
Pre-trained models corresponding to 0.1250bpp, 0.0313bpp and 0.0019bpp can be downloaded here.
All models were trained using a DGX H100 using the following command:
# note that prediction_type must equal config.py prediction_type
!accelerate launch --multi_gpu --num_processes=8 /tf/notebooks/PerCo/src/train_sd_perco.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1" \
--validation_image "/tf/notebooks/PerCo/res/eval/kodim13.png" "/tf/notebooks/PerCo/res/eval/kodim23.png" \
--allow_tf32 \
--dataloader_num_workers=12 \
--resolution=512 --center_crop --random_flip \
--train_batch_size=20 \
--gradient_accumulation_steps=1 \
--num_train_epochs=5 \
--max_train_steps 50000 \
--validation_steps 500 \
--prediction_type="v_prediction" \
--checkpointing_steps 500 \
--learning_rate=1e-05 \
--adam_weight_decay=1e-2 \
--max_grad_norm=1 \
--lr_scheduler="constant" \
--lr_warmup_steps=10000 \
--checkpoints_total_limit=2 \
--output_dir="/tf/notebooks/PerCo/res/cmvl_2024"
If you find better hyper-parameters, please share them with the community.
Directions for Improvement
- Investigate scalar quantizer + hyper-decoder (similar to Agustsson et al. ICCV 2019)
- The authors only considered controlling the bit-rate via upper bound (i.e. uniform coding scheme); incorporating a powerful entropy model will likely exceed the reported performance.
File Structure
docker # Docker functionality + dependecies
├── install.txt
notebooks # jupyter-notebooks
├── FilterMSCOCO.ipynb # How to obtain MS-COCO 30k
├── PerceptualCompression.ipynb # How to train and eval PerCo
res
├── cmvl_2024/ # saved model, c
Related Skills
node-connect
339.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.9kCommit, push, and open a PR
