<details> <summary>Robust Latent Matters: Boosting Image Generation with Sampling Error Synthesis <div align="center">

</div> </summary> <p align="center"> <div align=center> <img src=assets/robusttok.png/> </div> </details> <details> <summary>XQ-GAN🚀: An Open-source Image Tokenization Framework for Autoregressive Generation <div align="center">

</div> </summary> <p align="center"> <div align=center> <img src=assets/xqgan.png/> </div> </details> <details> <summary>ImageFolder🚀: Autoregressive Image Generation with Folded Tokens <div align="center">

</div> <p align="center"> </summary> <div align=center> <img src=assets/teaser.png/> </div> </details>

Updates

(2025.03.13) RobustTok initial code released.
(2025.01.22) ImageFolder got accepted to ICLR 2025.
(2024.12.03) XQ-GAN initial code released. ImageFolder is compatible in XQ-GAN.
(2024.12.02) ImageFolder's code has been released officially at Adobe Research Repo.

Features

🚨🚨🚨 New (2025.03): We are supporting latent perturbation + pFID evaluation proposed in RobustTok! Refer to latent_perturbation.py.

⚠️⚠️⚠️ Important: You may want to add the perturbation after calculating vq and commit losses, i.e., making the perturbation only affect rec, percep and gan losses.

# Plug and play perturbation to improve your tokenizer‘s latent robustness
import latent_perturbation as LP

# Dummy quantization implementation
class quantizer():
    def __init__():
        self.enc = Encoder()
        self.dec = Decoder()
        self.quant = Quantizer()
        self.codebook = self.quant.codebook
    def quantize(x):
        x = self.enc(x)
        x = self.quant(x)
        #-----------------------------#
        # This is all you need to add!
	# alpha: perturbation rate. beta: perturbation proportion. delta: perturbation strength.
        x = LP.add_perturb(x, z_channels=self.z_channels, codebook_norm = self.codebook_norm , codebook=self.codebook, alpha=0.5, beta=0.1, delta=100)
        #-----------------------------#
        x = self.dec(x)
        return x

<p align="center"> <details> <summary>Basic features of the highly flexible quantization framework <div align="center"> </div> </summary> <div align=center> <img src=assets/table.png/> </div>

XQ-GAN is a highly flexible framework that supports the combination of several advanced quantization approaches, backbone architectures, and training recipes (semantic alignment, discriminators, and auxiliary losses). In addition, we also provide finetuning support with full, LORA, and frozen from pre-trained weights.

<p align="center"> <div align=center> <img src=assets/quantizer.png width="500" /> </div> We implemented a hierarchical quantization approach, which first decides the product quantization (PQ) and then the residual quantization (RQ). The minimum unit of this design consists of vector quantization (VQ), lookup-free quantization (LFQ), and binary spherical quantization (BSQ). A vanilla VQ can be achieved in this framework by setting the product branch and residual depth to 1. </details>

Model Zoo

We provide pre-trained tokenizers for image reconstruction on ImageNet, LAION-400M (natural image), and IMed361M (multimodal medical image) 256x256 resolution. V: Vector quantization. B: Binary Spherical Quantization. P: Product quantization. R: Residual quantization. MS: Multi-scale. LP: Latent Perturbation. The type is arranged as MS-{V,B}-{R}-{P}-LP.

| Training | Type | Codebook | Latent res. | rFID | pFID | Link | Config | | :------: | :--: | :-----------: | :---------: | :----: | :----: |:-------------------------------------------------------------------: | :----: | | ImageNet | V | 4096 | 16x16 | 0.91 | 6.98 | Huggingface | VQ-4096.yaml | | ImageNet | V | 8192 | 16x16 | 0.81 | 7.91 | Huggingface | VQ-8192.yaml | | ImageNet | VP+LP | 4096 | 16x16 | 1.02 | 2.28 | Huggingface | RobustTok.yaml | | ImageNet | VP2 | 4096 | 16x16 | 0.90 | - | Huggingface | VP2-4096.yaml | | ImageNet | VP2 | 16384 | 16x16 | 0.64 | - | Huggingface | VP2-16384.yaml |

| Training | Type | Codebook | Latent res. | rFID | pFID| Link | Config | | :------: | :------: | :-----------: | :---------: | :----: | :----: |:----: | :----: | | ImageNet | MSBR10P2 | 4096 | 1x1->11x11 | 0.86 | - | Huggingface | MSBR10P2-4096.yaml | | ImageNet | MSBR10P2 | 16384 | 1x1->11x11 | 0.78 | - | Huggingface | MSBR10P2-16384.yaml |

| Training | Type | Codebook | Latent res. | rFID | pFID | Link | Config | | :--------: | :------: | :-----------: | :---------: | :----: | :----: |:----------------------------------------------------------------------------------------------------: | :----: | | ImageNet | MSVR10P2 | 4096 | 1x1->11x11 | 0.80 | 7.23 | Huggingface | MSVR10P2-4096.yaml | | ImageNet | MSVR10P2 | 8192 | 1x1->11x11 | 0.70 | - | Huggingface | MSVR10P2-8192.yaml | | ImageNet | MSVR10P2 | 16384 | 1x1->11x11 | 0.67 | - | Huggingface | MSVR10P2-16384.yaml | | IMed | MSVR10P2 | 4096 | 1x1->11x11 | - | - | Huggingface | MSVR10P2-4096.yaml | | LAION | MSVR10P2 | 4096 | 1x1->11x11 | - | - | Huggingface | MSVR10P2-4096.yaml |

We provide a pre-trained generators for class-conditioned image generation using MSVR10P2 (ImgaeFolder's setting) and VP+Latent Perturb (LP) on ImageNet 256x256 resolution.

| Generator Type | Tokenizer | Model Size | gFID | Link | | :--: | :------: | :--------: | :----: | :-------------------------------------------------------------------------------------------------------: | | VAR | MSVR10P2 | 362M | 2.60 | Huggingface |
| RAR | VP+LP | 261M | 1.83 | Huggingface | | RAR | VP+LP | 461M | 1.60 | Huggingface |

Installation

Install all packages as

conda env create -f environment.yml

Dataset

We download the ImageNet2012 from the website and collect it as

ImageNet2012
├── train
└── val

If you want to train or finetune on other datasets, collect them in the format that ImageFolder (pytorch's ImageFolder) can recognize.

Dataset
├── train
│   ├── Class1
│   │   ├── 1.png
│   │   └── 2.png
│   ├── Class2
│   │   ├── 1.png
│   │   └── 2.png
├── val

Training code for tokenizer

Please login to Wandb first using

wandb login

rFID will be automatically evaluated and reported on Wandb. The checkpoint with the best rFID on the val set will be saved. We provide basic configurations in the "configs" folder.

Warning❗️: You may want to modify the metric to save models as rFID is not closely correlated to gFID. PSNR and SSIM are also good choices.

torchrun --nproc_per_node=8 tokenizer/tokenizer_image/xqgan_train.py --config configs/xxxx.yaml

Please modify the configuration file as needed for your specific dataset. We list some important ones here.

vq_ckpt: ckpt_best.pt                # resume
cloud_save_path: output/exp-xx       # output dir
data_path: ImageNet2012/train        # training set dir
val_data_path: ImageNet2012/val      # val set dir
enc_tun

ImageFolder

Install / Use

README

Contents