GigaTok
[ICCV 2025] Official repo for "GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation"
Install / Use
/learn @SilentView/GigaTokREADME
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
<div align="center"> <!-- *ICCV 2025* --> </div> <p align="center"> <img src="assets/images/qual_grid.png" width=95%> <p> <div align="center"> <a href="https://scholar.google.com/citations?user=tTMKGSYAAAAJ&hl" target="_blank">Tianwei Xiong</a><sup>1*</sup>   <b>·</b>   <a href="https://scholar.google.com.sg/citations?user=8gm-CYYAAAAJ&hl=en" target="_blank">Jun Hao Liew</a><sup>2</sup>   <b>·</b>   <a href="https://speedinghzl.github.io/" target="_blank">Zilong Huang</a><sup>2</sup>   <b>·</b>   <a href="https://scholar.google.com.sg/citations?user=Q8iay0gAAAAJ&hl" target="_blank">Jiashi Feng</a><sup>2</sup>   <b>·</b>   <a href="https://xh-liu.github.io/" target="_blank">Xihui Liu</a><sup>1✉</sup><br> <sup>1</sup>The University of Hong Kong   <sup>2</sup>ByteDance Seed   <br> <sup>*</sup>Work partly done as an Intern at ByteDance. ✉ Corresponding author   <br> </div> <br>🔈News
- [2025/06/26] GigaTok is accepted by ICCV 2025!
- [2025/04/14] Research paper, code, and models are released for GigaTok!
Introduction
<p align="center"> <img src="assets/images/teaser_1col_v2.png" width=75%> <p>We introduce GigaTok, the first method for scaling visual tokenizers to 3 billion parameters. We reveal that reconstruction vs. generation dilemma for scaling tokenizers is caused by increasing latent space complexity. And it can be resolved by semantic regularization. And to scale up a visual tokenizer to 3B:
- 1D tokenizers are more scalable than 2D tokenizers.
- It is better prioritizing decoder scaling when expanding both encoder and decoder.
- Entropy loss helps stabilize training for billion-scale tokenizers.
🚀 In this codebase, we release
- A series of tokenizers ranging from 136M to 3B, with AR models trained on them.
- A comprehensive framework for experimental exploration of tokenizer training and evaluation, beyond reconstruction target.
Environment Setup
To set up the environment for Gigatok, follow these steps:
# A working CUDA version: 12.1
# Correspond to TORCH_RUN_PATH in set_env_vars.sh
conda create -n gigatok python=3.9
conda activate gigatok
# Install required packages using the provided script
bash env_install.sh
Download Checkpoints
All the tokenizers are for 256x256 images. You can also download the models from Hugging Face.
| Tokenizer | Config | Param. (Tokenizer) | rFID | LPIPS | Tokenizer Download Link | AR Model | Param. (AR) | gFID | Acc. | AR Model Download Link | | --------------- | --- | ------------------ | ---- | ------ | -------------------------------------------------------------------------------------------------------------- | -------- | ----------------- | ---- | ----- | ------------------------------------------------------------------------------------------------------------------------- | | S-S | VQ_SS256.yaml | 136M | 1.01 | 0.2226 | VQ_SS256_e100.pt | GPT-B | 111M | 4.05 | 62.6% | GPT_B256_e300_VQ_SS.pt | | S-B | VQ_SB256.yaml | 232M | 0.89 | 0.2121 | VQ_SB256_e200.pt | GPT-B | 111M | 3.83 | 62.9% | GPT_B256_e300_VQ_SB.pt | | B-L | VQ_BL256.yaml | 622M | 0.81 | 0.2059 | VQ_BL256_e200.pt | GPT-B | 111M | 3.26 | 67.6% | GPT_B256_e300_VQ_BL.pt | | B-L (dino disc) | VQ_BL256_dino_disc.yaml | 622M | 0.51 | 0.2056 | VQ_BL256_dino_disc.pt | GPT-B | 111M | 3.33 | 67.7% | GPT_B256_e300_VQ_BL_dino_disc.pt | | XL-XXL | VQ_XLXXL256.yaml | 2.9B | 0.79 | 0.1947 | VQ_XLXXL256_e300.pt | GPT-B | 111M | 3.15 | 72.0% | GPT_B256_e300_VQ_XLXXL.pt |
| Tokenizer | Config | Param. (Tokenizer) | rFID | LPIPS | Tokenizer Download Link | | --------------- | --- | ------------------ | ---- | ------ | -------------------------------------------------------------------------------------------------------------- | | S-S-2d | VQ_SS256_2d.yaml | 111M | 1.22 | 0.2227 | VQ_SS256_2d_e100.pt | | S-B-2d | VQ_SB256_2d.yaml | 197M | 0.97 | 0.2118 | VQ_SB256_2d_e200.pt | | B-L-2d | VQ_BL256_2d.yaml | 491M | 0.86 | 0.2046 | VQ_BL256_2d_e200.pt |
Larger AR models Downloading
| Tokenzier | Config | AR Model | Param. (AR) | gFID | Acc. | AR Model Download Link | | --------- | --- | -------- | ----------------- | ---- | ----- | ----------------------------------------------------------------------------------------------------------------- | | B-L | VQ_BL256.yaml | GPT-XL | 775M | 2.13 | 70.6% | GPT_XL256_e300_VQ_BL.pt | | B-L | VQ_BL256.yaml | GPT-XXL | 1.4B | 2.03 | 69.4% | GPT_XXL256_e300_VQ_BL.pt | | XL-XXL | VQ_XLXXL256.yaml | GPT-XXL | 1.4B | 1.98 | 74.0% | GPT_XXL256_e300_VQ_XLXXL.pt |
Inference and Evaluation
Tokenizer Reconstruction
To perform tokenizer reconstruction, you need to set up the required environment variables and run the reconstruction script. Follow the instructions below:
- Set Environment Variables
Modify theset_env_vars.shscript according to the comments in it. For this reconstruction task, you only need to specify the following variables: PROJECT_ROOT and TORCH_RUN_PATH.
# Define the required path/env related variables
. set_env_vars.sh
# Choose the tokenizer configuration
# For S-S Tokenizer (128M)
export TOK_CONFIG="configs/vq/VQ_SS256.yaml"
export VQ_CKPT=results/recheck/VQ_SS256_e100.pt
# Uncomment the following for S-B (232M)
# export TOK_CONFIG="configs/vq/VQ_SB256.yaml"
# export VQ_CKPT=results/recheck/VQ_SB256_e200.pt
# Uncomment the following for B-L (622M)
# export TOK_CONFIG="configs/vq/VQ_BL256.yaml"
# export VQ_CKPT=results/recheck/VQ_BL256_e200.pt
# Uncomment the following for B-L (dino disc) (622M)
# export TOK_CONFIG="configs/vq/VQ_BL256_dinodisc.yaml"
# export VQ_CKPT=results/ckpts/VQ_BL256_dino_disc.pt
# Uncomment the following for XL-XXL (2.9B)
# export TOK_CONFIG="configs/vq/VQ_XLXXL256.yaml"
# export VQ_CKPT=results/ckpts/VQ_XLXXL256_e300.pt
- Run the Qualitative Reconstruction Script
DATA_PATH=${PROJECT_ROOT}/tests/
# this is the output directory
SAMPLE_DIR=results/reconstructions/
gpus=1 \
PORT=11086 \
bash scripts/reconstruction.sh \
--quant-way=vq \
--data-path=${DATA_PATH} \
--image-size=256 \
--sample-dir=$SAMPLE_DIR \
--vq-ckpt=${VQ_CKPT} \
--model-config ${TOK_CONFIG} \
--qualitative \
--lpips \
--clear-cache
For the quantitative reconstruction evaluation, see Detailed_instructions
AR Model Inference for class-conditional generation
Qualitative Sampling
# Try these classes!
# [388]='giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca'
# [90]='lorikeet'
# [323]='monarch, monarch butterfly, milkweed butterfly, Danaus plexippus'
# [84]='peacock'
# [980]='volcano'
# [977]='sandbar, sand bar'
# [978]='seashore, coast, seacoast, sea-coast'
# [979]='valley, vale'
# [972]='cl
