Infinity
[CVPR 2025 Oral]Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
Install / Use
/learn @FoundationVision/InfinityREADME
Infinity $\infty$: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
<div align="center"> </div> <p align="center" style="font-size: larger;"> <a href="https://arxiv.org/abs/2412.04431">Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis</a> </p> <p align="center"> <img src="assets/show_images.jpg" width=95%> <p>🔥 Updates!!
- Nov 7, 2025: 🔥 We Release our Text-to-Video generation based on VAR & Infinity, please check Infinity⭐️.
- Jun 24, 2025: 🍉 Release a middle stage model of Infinity-8B generating 512x512 images.
- May 25, 2025: 🔥 Release Infinity Image tokenizer training code & setting, Check Link
- Apr 24, 2025: 🔥 Infinity is accepted as CVPR 2025 Oral.
- Feb 18, 2025: 🔥 Infinity-8B Weights & Code is released!
- Feb 7, 2025: 🌺 Infinity-8B Demo is released! Check demo.
- Dec 24, 2024: 🔥 Training and Testing Codes && Checkpoints && Demo released!
- Dec 12, 2024: 💻 Add Project Page
- Dec 10, 2024: 🏆 Visual AutoRegressive Modeling received NeurIPS 2024 Best Paper Award.
- Dec 5, 2024: 🤗 Paper release
🕹️ Try and Play with Infinity!
We provide a demo website for you to play with Infinity and generate images interactively. Enjoy the fun of bitwise autoregressive modeling!
We also provide interactive_infer.ipynb and interactive_infer_8b.ipynb for you to see more technical details about Infinity-2B & Infinity-8B.
📑 Open-Source Plan
- [ ] Infinity-20B Checkpoints
- [x] Infinity Image tokenizer training code & setting
- [x] Infinity-8B Checkpoints (512x512)
- [x] Infinity-8B Checkpoints (1024x1024)
- [x] Training Code
- [x] Web Demo
- [x] Inference Code
- [x] Infinity-2B Checkpoints
- [x] Visual Tokenizer Checkpoints
📖 Introduction
We present Infinity, a Bitwise Visual AutoRegressive Modeling capable of generating high-resolution and photorealistic images. Infinity redefines visual autoregressive model under a bitwise token prediction framework with an infinite-vocabulary tokenizer & classifier and bitwise self-correction. Theoretically scaling the tokenizer vocabulary size to infinity and concurrently scaling the transformer size, our method significantly unleashes powerful scaling capabilities. Infinity sets a new record for autoregressive text-to-image models, outperforming top-tier diffusion models like SD3-Medium and SDXL. Notably, Infinity surpasses SD3-Medium by improving the GenEval benchmark score from 0.62 to 0.73 and the ImageReward benchmark score from 0.87 to 0.96, achieving a win rate of 66%. Without extra optimization, Infinity generates a high-quality 1024×1024 image in 0.8 seconds, making it 2.6× faster than SD3-Medium and establishing it as the fastest text-to-image model.
🔥 Redefines VAR under a bitwise token prediction framework 🚀:
<p align="center"> <img src="assets/framework_row.png" width=95%> <p>Infinite-Vocabulary Tokenizer✨: We proposes a new bitwise multi-scale residual quantizer, which significantly reduces memory usage, enabling the training of extremely large vocabulary, e.g. $V_d = 2^{32}$ or $V_d = 2^{64}$.
Infinite-Vocabulary Classifier✨: Conventional classifier predicts $2^d$ indices. IVC predicts $d$ bits instead. Slight perturbations to near-zero values in continuous features cause a complete change of indices labels. Bit labels change subtly and still provide steady supervision. Besides, if d = 32 and h = 2048, a conventional classifier requires 8.8T parameters. IVC only requires 0.13M.
Bitwise Self-Correction✨: Teacher-forcing training in AR brings severe train-test discrepancy. It lets the transformer only refine features without recognizing and correcting mistakes. Mistakes will be propagated and amplified, finally messing up generated images. We propose Bitwise Self-Correction (BSC) to mitigate the train-test discrepancy.
🔥 Scaling Vocabulary benefits Reconstruction and Generation 📈:
<p align="center"> <img src="assets/scaling_vocabulary.png" width=95%> <p>🔥 Discovering Scaling Laws in Infinity transformers 📈:
<p align="center"> <img src="assets/scaling_models.png" width=95%> <p>🏘 Infinity Model ZOO
We provide Infinity models for you to play with, which are on <a href='https://huggingface.co/FoundationVision/infinity'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20weights-FoundationVision/Infinity-yellow'></a> or can be downloaded from the following links:
Visual Tokenizer
- Release Infinity Image tokenizer training code & setting, Check Link
| vocabulary | stride | IN-256 rFID $\downarrow$ | IN-256 PSNR $\uparrow$ | IN-512 rFID $\downarrow$ | IN-512 PSNR $\uparrow$ | HF weights🤗 | |:----------:|:-----:|:--------:|:---------:|:-------:|:-------:|:------------------------------------------------------------------------------------| | $V_d=2^{16}$ | 16 | 1.22 | 20.9 | 0.31 | 22.6 | infinity_vae_d16.pth | | $V_d=2^{24}$ | 16 | 0.75 | 22.0 | 0.30 | 23.5 | infinity_vae_d24.pth | | $V_d=2^{32}$ | 16 | 0.61 | 22.7 | 0.23 | 24.4 | infinity_vae_d32.pth | | $V_d=2^{64}$ | 16 | 0.33 | 24.9 | 0.15 | 26.4 | infinity_vae_d64.pth | | $V_d=2^{32}$ | 16 | 0.75 | 21.9 | 0.32 | 23.6 | infinity_vae_d32_reg.pth |
Infinity
| model | Resolution | GenEval | DPG | HPSv2.1 | HF weights🤗 | |:----------:|:-----:|:--------:|:---------:|:-------:|:------------------------------------------------------------------------------------| | Infinity-2B | 1024 | 0.69 / 0.73 $^{\dagger}$ | 83.5 | 32.2 | infinity_2b_reg.pth | | Infinity-8B | 1024 | 0.79 $^{\dagger}$ | 86.6 | - | infinity_8b_weights | | Infinity-8B | 512 | - | - | - | infinity_8b_512x512_weights | | Infinity-20B | 1024 | - | - | - | Coming Soon |
${\dagger}$ result is tested with a prompt rewriter.
You can load these models to generate images via the codes in interactive_infer.ipynb and interactive_infer_8b.ipynb .
⚽️ Installation
- We use FlexAttention to speedup training, which requires
torch>=2.5.1. - Install other pip packages via
pip3 install -r requirements.txt. - Download weights from huggingface. Besides vae & transformers weights on <a href='https://huggingface.co/FoundationVision/infinity'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20weights-FoundationVision/Infinity-yellow'></a>, you should also download flan-t5-xl.
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl")
These three lines will download flan-t5-xl to your ~/.cache/huggingface directory.
🎨 Data Preparation
The structure of the training dataset is listed as bellow. The training dataset contains a list of json files with name "[h_div_w_template1]_[num_examples].jsonl". Here [h_div_w_template] is a float number, which is the template ratio of height to width of the image. [num_examples] is the number of examples where $h/w$ is around h_div_w_template. dataset_t2i_iterable.py supports traing with >100M examples. But we have to specify the number of examples for each h/w template ratio in the filename.
/path/to/dataset/:
[h_div_w_template1]_[num_examples].jsonl
[h_div_w_template2]_[num_examples].jsonl
[h_div_w_template3]_[num_examples].jsonl
Each "[h_div_w_template1]_[num_examples].jsonl" file contains lines of dumped json item. Each json item contains the following information:
{
"image_path": "path/to/image, required",
"h_div_w": "float value of h_div_w for the image, required",
"long_caption": long caption of the image, required",
"lo
Related Skills
qqbot-channel
342.5kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
99.6k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
342.5kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
