VidTok
a family of versatile and state-of-the-art video tokenizers.
Install / Use
/learn @microsoft/VidTokREADME
VidTok <br> <sub>A Family of Versatile and State-Of-The-Art Video Tokenizers</sub>
</div>
We introduce VidTok, a cutting-edge family of video tokenizers that excels in both continuous and discrete tokenizations. VidTok incorporates several key advancements over existing approaches:
- ⚡️ Efficient Architecture. Separate spatial and temporal sampling reduces computational complexity without sacrificing quality.
- 🔥 Advanced Quantization. Finite Scalar Quantization (FSQ) addresses training instability and codebook collapse in discrete tokenization.
- 💥 Enhanced Training. A two-stage strategy—pre-training on low-res videos and fine-tuning on high-res—boosts efficiency. Reduced frame rates improve motion dynamics representation.
VidTok, trained on a large-scale video dataset, outperforms previous models across all metrics, including PSNR, SSIM, LPIPS, and FVD.
https://github.com/user-attachments/assets/a3341037-130d-4a83-aba6-c3daeaf66932
🔥 News
- August, 2025: 🚀 Introduced spatial tiling for large resolutions (>256), reducing GPU memory usage to ~6 GB when encoding and decoding a 17 × 768 × 768 video.
- March, 2025: 🚀 VidTwin has been accepted by CVPR 2025, and the checkpoint was released!
- March, 2025: 🚀 VidTok v1.1 was released! We fine-tuned all causal models on long videos to support tokenization and reconstruction of videos of arbitrary length with fine temporal smoothness. Relevant checkpoints are continuously updating.
- December, 2024: 🚀 VidTwin was released!
- December, 2024: 🚀 VidTok was released!
💥 Updates in VidTok v1.1
VidTok v1.1 is an update for causal models. We fine-tuned all causal models on long videos to support tokenization and reconstruction of videos of arbitrary length with fine temporal smoothness. See performance here.
v1.1: Long Video Reconstruction
Run the following inference script to reconstruct an input video:
python scripts/inference_reconstruct.py --config CONFIG_v1_1 --ckpt CKPT_v1_1 --input_video_path VIDEO_PATH --input_height 256 --input_width 256 --sample_fps 30 --chunk_size CHUNK_SIZE --output_video_dir OUTPUT_DIR --read_long_video
# Set `CHUNK_SIZE` according to your GPU memory, recommendly 16.
and run the following inference script to evaluate the reconstruction performance:
python scripts/inference_evaluate.py --config CONFIG_v1_1 --ckpt CKPT_v1_1 --data_dir DATA_DIR --input_height 256 --input_width 256 --sample_fps 30 --chunk_size CHUNK_SIZE --read_long_video
# Set `CHUNK_SIZE` according to your GPU memory, recommendly 16.
For an easy usage of VidTok v1.1 models, refer to this script and make the following revision:
# Use VidTok v1.1 models
cfg_path = "configs/vidtok_v1_1/vidtok_kl_causal_488_4chn_v1_1.yaml"
ckpt_path = "checkpoints/vidtok_v1_1/vidtok_kl_causal_488_4chn_v1_1.ckpt"
...
model.to('cuda').eval()
# Using tiling inference to save memory usage
model.use_tiling = True
model.t_chunk_enc = 16
model.t_chunk_dec = model.t_chunk_enc // model.encoder.time_downsample_factor
model.use_overlap = True
# random input: long video
x_input = (torch.rand(1, 3, 129, 256, 256) * 2 - 1).to('cuda')
...
if x_recon.shape[2] != x_input.shape[2]:
x_recon = x_recon[:, :, -x_input.shape[2]:, ...]
v1.1: Long Video Fine-tuning
Follow this training guidance to fine-tune on your custom long video data and note that:
- Compared to VidTok v1.0, we tend to use longer sequences to fine-tune the model (for example, setting
NUM_FRAMES_1to 33, 49, or larger). - The resolution and the sequence length of training data should be adjusted according to GPU memory.
v1.1: Performance
| Model | Regularizer | Causal | VCR | PSNR | SSIM | LPIPS | FVD | |------|------|------|------|------|------|------|------| | vidtok_kl_causal_488_16chn_v1_1 | KL-16chn | ✔️ | 4x8x8 | 35.13 | 0.941 | 0.049 | 87.4 | | vidtok_kl_causal_41616_16chn_v1_1 | KL-16chn | ✔️ | 4x16x16 | 29.61 | 0.854 | 0.113 | 162.7 | | vidtok_kl_causal_288_8chn_v1_1 | KL-8chn | ✔️ | 2x8x8 | 34.59 | 0.935 | 0.051 | 78.2 | | vidtok_fsq_causal_488_32768_v1_1 | FSQ-32,768 | ✔️ | 4x8x8 | 29.39 | 0.856 | 0.114 | 168.5 | | vidtok_fsq_causal_888_32768_v1_1 | FSQ-32,768 | ✔️ | 8x8x8 | 27.95 | 0.817 | 0.142 | 293.2 |
- This is the evaluation result of long video reconstruction conducted on each complete video in MCL_JCL dataset, with a sample fps of 30 and a resolution of
256x256.
🔧 Setup
- Clone this repository and navigate to VidTok folder:
git clone https://github.com/microsoft/VidTok
cd VidTok
- We provide an
environment.yamlfile for setting up a Conda environment. Conda's installation instructions are available here.
# 1. Prepare conda environment
conda env create -f environment.yaml
# 2. Activate the environment
conda activate vidtok
We recommend using 1+ high-end GPU for training and inference. We have done all testing and development using A100 and MI300X GPUs. For convenience, we also provide prebuilt Docker images with required dependencies. You can use it as follows:
# NVIDIA GPUs
docker run -it --gpus all --shm-size 256G --rm -v `pwd`:/workspace --workdir /workspace \
deeptimhe/ubuntu22.04-cuda12.1-python3.10-pytorch2.5:orig-vidtok bash
# AMD GPUs
docker run -it --gpus all --shm-size 256G --rm -v `pwd`:/workspace --workdir /workspace \
deeptimhe/ubuntu22.04-rocm6.2.4-python3.10-pytorch2.5:orig-vidtok bash
🎈 Checkpoints
Download pre-trained models here, and put them in checkpoints folder, like:
└── checkpoints
├── vidtok_v1_1
│ ├── vidtok_kl_causal_488_16chn_v1_1.ckpt
│ └── ...
├── vidtok_fsq_causal_41616_262144.ckpt
├── vidtok_fsq_causal_488_262144.ckpt
├── vidtok_fsq_causal_488_32768.ckpt
├── vidtok_fsq_causal_488_4096.ckpt
├── vidtok_fsq_noncausal_41616_262144.ckpt
├── vidtok_fsq_noncausal_488_262144.ckpt
├── vidtok_kl_causal_288_8chn.ckpt
├── vidtok_kl_causal_41616_4chn.ckpt
├── vidtok_kl_causal_444_4chn.ckpt
├── vidtok_kl_causal_488_16chn.ckpt
├── vidtok_kl_causal_488_4chn.ckpt
├── vidtok_kl_causal_488_8chn.ckpt
├── vidtok_kl_noncausal_41616_16chn.ckpt
├── vidtok_kl_noncausal_41616_4chn.ckpt
├── vidtok_kl_noncausal_488_16chn.ckpt
└── vidtok_kl_noncausal_488_4chn.ckpt
Each checkpoint has a corresponding config file with the same name in configs folder.
🔆 Performance
| Model | Regularizer | Causal | VCR | PSNR | SSIM | LPIPS | FVD | |------|------|------|------|------|------|------|------| | vidtok_kl_causal_488_4chn | KL-4chn | ✔️ | 4x8x8 | 29.64 | 0.852| 0.114| 194.2| | vidtok_kl_causal_488_8chn | KL-8chn | ✔️ |4x8x8 | 31.83 | 0.897| 0.083| 109.3| | vidtok_kl_causal_488_16chn | KL-16chn | ✔️ | 4x8x8 | 35.04 |0.942 |0.047 | 78.9| | vidtok_kl_causal_288_8chn | KL-8chn | ✔️ | 2x8x8 | 33.86 | 0.928 |0.057 | 80.7 | | vidtok_kl_causal_444_4chn | KL-4chn | ✔️ | 4x4x4 | 34.78 | 0.941 | 0.051| 87.2| | vidtok_kl_causal_41616_4chn | KL-4chn | ✔️ | 4x16x16 | 25.05 | 0.711| 0.228| 549.1| | vidtok_kl_noncausal_488_4chn | KL-4chn | ✖️ | 4x8x8 | 30.60 | 0.876 | 0.098| 157.9| | vidtok_kl_noncausal_488_16chn | KL-16chn | ✖️ | 4x8x8 | 36.13 | 0.950 | 0.044| 60.5| | vidtok_kl_noncausal_41616_4chn | KL-4chn | ✖️ | 4x16x16 | 26.06 | 0.751 | 0.190|423.2 | | [vidtok_kl_noncausal_41616_16chn](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_kl_noncausal_41616
Related Skills
docs-writer
99.5k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
341.0kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
project-overview
FlightPHP Skeleton Project Instructions This document provides guidelines and best practices for structuring and developing a project using the FlightPHP framework. Instructions for AI Coding A
ddd
Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso
