DeTok
Official PyTorch Implementation of "Latent Denoising Makes Good Visual Tokenizers"
Install / Use
/learn @Jiawei-Yang/DeTokREADME
DeTok: Latent Denoising Makes Good Visual Tokenizers <br><sub>Official PyTorch Implementation</sub>
<p align="center"> <img src="assets/method.png" width="720"> </p>This is a PyTorch/GPU implementation of the paper Latent Denoising Makes Good Visual Tokenizers:
@article{yang2025detok,
title={Latent Denoising Makes Good Visual Tokenizers},
author={Jiawei Yang and Tianhong Li and Lijie Fan and Yonglong Tian and Yue Wang},
journal={arXiv preprint arXiv:2507.15856},
year={2025}
}
This repo contains:
- 🪐 A simple PyTorch implementation of l-DeTok tokenizer and various generative models (MAR, RandARDiff, RasterARDiff, DiT, SiT, and LightningDiT)
- ⚡️ Pre-trained DeTok tokenizers and MAR models trained on ImageNet 256x256
- 🛸 Training and evaluation scripts for tokenizer and generative models
- 🎉 Hugging Face for easy access to pre-trained models
Preparation
Installation
Download the code:
git clone https://github.com/Jiawei-Yang/detok.git
cd detok
Create and activate conda environment:
conda create -n detok python=3.10 -y && conda activate detok
pip install -r requirements.txt
Dataset
Create data/ folder by mkdir data/ then download ImageNet dataset. You can either:
- Download it directly to
data/imagenet/ - Create a symbolic link:
ln -s /path/to/your/imagenet data/imagenet
Download Required Files
Create data directory and download required files:
mkdir data/
# Download everything from huggingface
huggingface-cli download jjiaweiyang/l-DeTok --local-dir released_model
mv released_model/train.txt data/
mv released_model/val.txt data/
mv released_model/fid_stats data/
mv released_model/imagenet-val-prc.zip ./data/
# Unzip: imagenet-val-prc.zip for precision & recall evaluation
python -m zipfile -e ./data/imagenet-val-prc.zip ./data/
Data Organization
Your data directory should be organized as follows:
data/
├── fid_stats/ # FID statistics
│ ├── adm_in256_stats.npz
│ └── val_fid_statistics_file.npz
├── imagenet/ # ImageNet dataset (or symlink)
│ ├── train/
│ └── val/
├── imagenet-val-prc/ # Precision-recall data
├── train.txt # Training file list
└── val.txt # Validation file list
Models
For convenience, our pre-trained models will be available on Hugging Face:
| Model | Type | Params | Hugging Face | |---------------------|-----------|-------|-------------| | DeTok-BB | Tokenizer | 172M | 🤗 detok-bb | | DeTok-BB-decoder_ft | Tokenizer | 172M | 🤗 detok-bb-decoder_ft | | MAR-Base | Generator | 208M | 🤗 mar-base | | MAR-Large | Generator | 479M | 🤗 mar-large |
FID-50k with CFG: |cfg| MAR Model | FID-50K | Inception Score | |---|-----------------------------------|---------|-----------------| |3.9| MAR-Base + DeTok-BB | 1.61 | 289.7 | |3.9| MAR-Base + DeTok-BB-decoder_ft | 1.55 | 291.0 | |3.4| MAR-Large + DeTok-BB | 1.43 | 303.5 | |3.4| MAR-Large + DeTok-BB-decoder_ft | 1.32 | 304.1 |
Usage
Demo
Run our demo using notebook at demo.ipynb
Training
1. Tokenizer Training
Train DeTok tokenizer with denoising:
project=tokenizer_training
exp_name=detokBB-g3.0-m0.7-200ep
batch_size=32 # global batch size = batch_size x num_nodes x 8 = 1024
num_nodes=4 # adjust for your multi-node setup
YOUR_WANDB_ENTITY="" # change to your wandb entity
torchrun --nproc_per_node=8 --nnodes=$num_nodes --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} \
main_reconstruction.py \
--project $project --exp_name $exp_name --auto_resume \
--batch_size $batch_size --model detok_BB \
--gamma 3.0 --mask_ratio 0.7 \
--online_eval \
--epochs 200 --discriminator_start_epoch 100 \
--data_path ./data/imagenet/train \
--entity $YOUR_WANDB_ENTITY --enable_wandb
Decoder fine-tuning:
project=tokenizer_training
exp_name=detokBB-g3.0-m0.7-200ep-decoder_ft-100ep
batch_size=32
num_nodes=4
pretrained_tok=work_dirs/tokenizer_training/detokBB-g3.0-m0.7-200ep/checkpoints/latest.pth
YOUR_WANDB_ENTITY=""
torchrun --nproc_per_node=8 --nnodes=$num_nodes --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} \
main_reconstruction.py \
--project $project --exp_name $exp_name --auto_resume \
--batch_size $batch_size --model detok_BB \
--load_from $pretrained_tok \
--online_eval --train_decoder_only \
--perceptual_weight 0.1 \
--gamma 0.0 --mask_ratio 0.0 \
--blr 5e-5 --warmup_rate 0.05 \
--epochs 100 --discriminator_start_epoch 0 \
--data_path ./data/imagenet/train \
--entity $YOUR_WANDB_ENTITY --enable_wandb
2. Generative Model Training
Train MAR-base (100 epochs):
tokenizer_project=tokenizer_training
tokenizer_exp_name=detokBB-g3.0-m0.7-200ep-decoder_ft-100ep
project=gen_model_training
exp_name=mar_base-${tokenizer_exp_name}
batch_size=32 # global batch size = batch_size x num_nodes x 8 = 1024
num_nodes=4
epochs=100
torchrun --nproc_per_node=8 --nnodes=$num_nodes --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} \
main_diffusion.py \
--project $project --exp_name $exp_name --auto_resume \
--batch_size $batch_size --epochs $epochs --use_aligned_schedule \
--tokenizer detok_BB --use_ema_tokenizer --collect_tokenizer_stats \
--stats_key $tokenizer_exp_name --stats_cache_path work_dirs/stats.pkl \
--load_tokenizer_from work_dirs/$tokenizer_project/$tokenizer_exp_name/checkpoints/latest.pth \
--model MAR_base --no_dropout_in_mlp \
--diffloss_d 3 --diffloss_w 1024 \
--num_sampling_steps 100 --cfg 4.0 \
--cfg_list 3.0 3.5 3.7 3.8 3.9 4.0 4.1 4.3 4.5 \
--vis_freq 50 --eval_bsz 256 \
--data_path ./data/imagenet/train \
--entity $YOUR_WANDB_ENTITY --enable_wandb
Train SiT-base (100 epochs):
tokenizer_project=tokenizer_training
tokenizer_exp_name=detokBB-g3.0-m0.7-200ep-decoder_ft-100ep
project=gen_model_training
exp_name=sit_base-${tokenizer_exp_name}
batch_size=32
num_nodes=4
epochs=100
torchrun --nproc_per_node=8 --nnodes=$num_nodes --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} \
main_diffusion.py \
--project $project --exp_name $exp_name --auto_resume \
--batch_size $batch_size --epochs $epochs --use_aligned_schedule \
--tokenizer detok_BB --use_ema_tokenizer --collect_tokenizer_stats \
--stats_key $tokenizer_exp_name --stats_cache_path work_dirs/stats.pkl \
--load_tokenizer_from work_dirs/$tokenizer_project/$tokenizer_exp_name/checkpoints/latest.pth \
--model SiT_base \
--num_sampling_steps 250 --cfg 1.6 \
--cfg_list 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 \
--vis_freq 50 --eval_bsz 256 \
--data_path ./data/imagenet/train \
--entity $YOUR_WANDB_ENTITY --enable_wandb
3. Training with the released DeTok
Train MAR-base for 800 epochs using released tokenizer:
project=gen_model_training
exp_name=mar_base_800ep-detok-BB-gamm3.0-m0.7-decoder_tuned
batch_size=16 # global batch size = batch_size x num_nodes x 8 = 1024
num_nodes=8
epochs=800
torchrun --nproc_per_node=8 --nnodes=$num_nodes --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} \
main_diffusion.py \
--project $project --exp_name $exp_name --auto_resume \
--batch_size $batch_size --epochs $epochs --use_aligned_schedule \
--tokenizer detok_BB --use_ema_tokenizer --collect_tokenizer_stats \
--stats_key detok-BB-gamm3.0-m0.7 --stats_cache_path released_model/stats.pkl \
--load_tokenizer_from released_model/detok-BB-gamm3.0-m0.7-decoder_tuned.pth \
--model MAR_base --no_dropout_in_mlp \
--diffloss_d 6 --diffloss_w 1024 \
--num_sampling_steps 100 --cfg 3.9 \
--cfg_list 3.0 3.5 3.7 3.8 3.9 4.0 4.1 4.3 4.5 \
--online_eval --vis_freq 80 --eval_bsz 256 \
--data_path ./data/imagenet/train \
--entity $YOUR_WANDB_ENTITY --enable_wandb
Train MAR-large for 800 epochs:
project=gen_model_training
exp_name=mar_large_800ep-detok-BB-gamm3.0-m0.7-decoder_tuned
batch_size=16
num_nodes=8
epochs=800
torchrun --nproc_per_node=8 --nnodes=$num_nodes --node_rank=${NODE_RANK} --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} \
main_diffusion.py \
--project $project --exp_name $exp_name --auto_resume \
--batch_size $batch_size --epochs $epochs --use_aligned_schedule \
--tokenizer detok_BB --use_ema_tokenizer --collect_tokenizer_stats \
--stats_key detok-BB-gamm3.0-m0.7 --stats_cache_path released_model/stats.pkl \
--load_tokenizer_from released_model/detok-BB-gamm3.0-m0.7-decoder_tuned.pth \
--model MAR_large --no_dropout_in_mlp \
--diffloss_d 8 --diffloss_w 1280 \
--num_sampling_steps 100 --cfg 3.4 \
--cfg_list 3.0 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 \
--online_eval --vis_freq 80 --eval_bsz 256 \
--data_path ./data/imagenet/train
Related Skills
node-connect
332.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
81.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
332.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
81.7kCommit, push, and open a PR
