VAR
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ultra-simple, user-friendly yet state-of-the-art* codebase for autoregressive image generation!
Install / Use
/learn @FoundationVision/VARREADME
VAR: a new visual generation method elevates GPT-style models beyond diffusion🚀 & Scaling laws observed📈
<div align="center"> </div> <p align="center" style="font-size: larger;"> <a href="https://arxiv.org/abs/2404.02905">Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction</a> </p> <div> <p align="center" style="font-size: larger;"> <strong>NeurIPS 2024 Best Paper</strong> </p> </div> <p align="center"> <img src="https://github.com/FoundationVision/VAR/assets/39692511/9850df90-20b1-4f29-8592-e3526d16d755" width=95%> <p> <br>News
- 2025-11: We Release our Text-to-Video generation model InfinityStar based on VAR & Infinity, please check Infinity⭐️.
- 2025-11: 🎉 InfinityStar is accepted as NeurIPS 2025 Oral.
- 2025-04: 🎉 Infinity is accepted as CVPR 2025 Oral.
- 2024-12: 🏆 VAR received NeurIPS 2024 Best Paper Award.
- 2024-12: 🔥 We Release our Text-to-Image research based on VAR, please check Infinity.
- 2024-09: VAR is accepted as NeurIPS 2024 Oral Presentation.
- 2024-04: Visual AutoRegressive modeling is released.
🕹️ Try and Play with VAR!
~~We provide a demo website for you to play with VAR models and generate images interactively. Enjoy the fun of visual autoregressive modeling!~~
We provide a demo website for you to play with VAR Text-to-Image and generate images interactively. Enjoy the fun of visual autoregressive modeling!
We also provide demo_sample.ipynb for you to see more technical details about VAR.
What's New?
🔥 Introducing VAR: a new paradigm in autoregressive visual generation✨:
Visual Autoregressive Modeling (VAR) redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction".
<p align="center"> <img src="https://github.com/FoundationVision/VAR/assets/39692511/3e12655c-37dc-4528-b923-ec6c4cfef178" width=93%> <p>🔥 For the first time, GPT-style autoregressive models surpass diffusion models🚀:
<p align="center"> <img src="https://github.com/FoundationVision/VAR/assets/39692511/cc30b043-fa4e-4d01-a9b1-e50650d5675d" width=55%> <p>🔥 Discovering power-law Scaling Laws in VAR transformers📈:
<p align="center"> <img src="https://github.com/FoundationVision/VAR/assets/39692511/c35fb56e-896e-4e4b-9fb9-7a1c38513804" width=85%> <p> <p align="center"> <img src="https://github.com/FoundationVision/VAR/assets/39692511/91d7b92c-8fc3-44d9-8fb4-73d6cdb8ec1e" width=85%> <p>🔥 Zero-shot generalizability🛠️:
<p align="center"> <img src="https://github.com/FoundationVision/VAR/assets/39692511/a54a4e52-6793-4130-bae2-9e459a08e96a" width=70%> <p>For a deep dive into our analyses, discussions, and evaluations, check out our paper.
VAR zoo
We provide VAR models for you to play with, which are on <a href='https://huggingface.co/FoundationVision/var'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Huggingface-FoundationVision/var-yellow'></a> or can be downloaded from the following links:
| model | reso. | FID | rel. cost | #params | HF weights🤗 | |:----------:|:-----:|:--------:|:---------:|:-------:|:------------------------------------------------------------------------------------| | VAR-d16 | 256 | 3.55 | 0.4 | 310M | var_d16.pth | | VAR-d20 | 256 | 2.95 | 0.5 | 600M | var_d20.pth | | VAR-d24 | 256 | 2.33 | 0.6 | 1.0B | var_d24.pth | | VAR-d30 | 256 | 1.97 | 1 | 2.0B | var_d30.pth | | VAR-d30-re | 256 | 1.80 | 1 | 2.0B | var_d30.pth | | VAR-d36 | 512 | 2.63 | - | 2.3B | var_d36.pth |
You can load these models to generate images via the codes in demo_sample.ipynb. Note: you need to download vae_ch160v4096z32.pth first.
Installation
-
Install
torch>=2.0.0. -
Install other pip packages via
pip3 install -r requirements.txt. -
Prepare the ImageNet dataset
<details> <summary> assume the ImageNet is in `/path/to/imagenet`. It should be like this:</summary>/path/to/imagenet/: train/: n01440764: many_images.JPEG ... n01443537: many_images.JPEG ... val/: n01440764: ILSVRC2012_val_00000293.JPEG ... n01443537: ILSVRC2012_val_00000236.JPEG ...NOTE: The arg
</details>--data_path=/path/to/imagenetshould be passed to the training script. -
(Optional) install and compile
flash-attnandxformersfor faster attention computation. Our code will automatically use them if installed. See models/basic_var.py#L15-L30.
Training Scripts
To train VAR-{d16, d20, d24, d30, d36-s} on ImageNet 256x256 or 512x512, you can run the following command:
# d16, 256x256
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
--depth=16 --bs=768 --ep=200 --fp16=1 --alng=1e-3 --wpe=0.1
# d20, 256x256
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
--depth=20 --bs=768 --ep=250 --fp16=1 --alng=1e-3 --wpe=0.1
# d24, 256x256
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
--depth=24 --bs=768 --ep=350 --tblr=8e-5 --fp16=1 --alng=1e-4 --wpe=0.01
# d30, 256x256
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
--depth=30 --bs=1024 --ep=350 --tblr=8e-5 --fp16=1 --alng=1e-5 --wpe=0.01 --twde=0.08
# d36-s, 512x512 (-s means saln=1, shared AdaLN)
torchrun --nproc_per_node=8 --nnodes=... --node_rank=... --master_addr=... --master_port=... train.py \
--depth=36 --saln=1 --pn=512 --bs=768 --ep=350 --tblr=8e-5 --fp16=1 --alng=5e-6 --wpe=0.01 --twde=0.08
A folder named local_output will be created to save the checkpoints and logs.
You can monitor the training process by checking the logs in local_output/log.txt and local_output/stdout.txt, or using tensorboard --logdir=local_output/.
If your experiment is interrupted, just rerun the command, and the training will automatically resume from the last checkpoint in local_output/ckpt*.pth (see utils/misc.py#L344-L357).
Sampling & Zero-shot Inference
For FID evaluation, use var.autoregressive_infer_cfg(..., cfg=1.5, top_p=0.96, top_k=900, more_smooth=False) to sample 50,000 images (50 per class) and save them as PNG (not JPEG) files in a folder. Pack them into a .npz file via create_npz_from_sample_folder(sample_folder) in utils/misc.py#L344.
Then use the [OpenAI's FID evaluation toolkit](https://github.com/openai/guided-diffusion/tree/main/ev
Related Skills
docs-writer
99.2k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
337.4kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
ddd
Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso
arscontexta
2.9kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
