VidTok

a family of versatile and state-of-the-art video tokenizers.

Generate Convert Improve

Install / Use

/learn @microsoft/VidTok

About this skill

Quality Score

0/100

README

VidTok <br> <sub>A Family of Versatile and State-Of-The-Art Video Tokenizers</sub>

</div>

radar

We introduce VidTok, a cutting-edge family of video tokenizers that excels in both continuous and discrete tokenizations. VidTok incorporates several key advancements over existing approaches:

⚡️ Efficient Architecture. Separate spatial and temporal sampling reduces computational complexity without sacrificing quality.
🔥 Advanced Quantization. Finite Scalar Quantization (FSQ) addresses training instability and codebook collapse in discrete tokenization.
💥 Enhanced Training. A two-stage strategy—pre-training on low-res videos and fine-tuning on high-res—boosts efficiency. Reduced frame rates improve motion dynamics representation.

VidTok, trained on a large-scale video dataset, outperforms previous models across all metrics, including PSNR, SSIM, LPIPS, and FVD.

https://github.com/user-attachments/assets/a3341037-130d-4a83-aba6-c3daeaf66932

🔥 News

August, 2025: 🚀 Introduced spatial tiling for large resolutions (>256), reducing GPU memory usage to ~6 GB when encoding and decoding a 17 × 768 × 768 video.

March, 2025: 🚀 VidTwin has been accepted by CVPR 2025, and the checkpoint was released!
March, 2025: 🚀 VidTok v1.1 was released! We fine-tuned all causal models on long videos to support tokenization and reconstruction of videos of arbitrary length with fine temporal smoothness. Relevant checkpoints are continuously updating.
December, 2024: 🚀 VidTwin was released!
December, 2024: 🚀 VidTok was released!

💥 Updates in VidTok v1.1

VidTok v1.1 is an update for causal models. We fine-tuned all causal models on long videos to support tokenization and reconstruction of videos of arbitrary length with fine temporal smoothness. See performance here.

v1.1: Long Video Reconstruction

Run the following inference script to reconstruct an input video:

python scripts/inference_reconstruct.py --config CONFIG_v1_1 --ckpt CKPT_v1_1 --input_video_path VIDEO_PATH --input_height 256 --input_width 256 --sample_fps 30 --chunk_size CHUNK_SIZE --output_video_dir OUTPUT_DIR --read_long_video
# Set `CHUNK_SIZE` according to your GPU memory, recommendly 16.

and run the following inference script to evaluate the reconstruction performance:

python scripts/inference_evaluate.py --config CONFIG_v1_1 --ckpt CKPT_v1_1 --data_dir DATA_DIR --input_height 256 --input_width 256 --sample_fps 30 --chunk_size CHUNK_SIZE --read_long_video
# Set `CHUNK_SIZE` according to your GPU memory, recommendly 16.

For an easy usage of VidTok v1.1 models, refer to this script and make the following revision:

# Use VidTok v1.1 models
cfg_path = "configs/vidtok_v1_1/vidtok_kl_causal_488_4chn_v1_1.yaml"
ckpt_path = "checkpoints/vidtok_v1_1/vidtok_kl_causal_488_4chn_v1_1.ckpt"

...

model.to('cuda').eval()
# Using tiling inference to save memory usage
model.use_tiling = True
model.t_chunk_enc = 16
model.t_chunk_dec = model.t_chunk_enc // model.encoder.time_downsample_factor
model.use_overlap = True
# random input: long video
x_input = (torch.rand(1, 3, 129, 256, 256) * 2 - 1).to('cuda') 

...

if x_recon.shape[2] != x_input.shape[2]:
    x_recon = x_recon[:, :, -x_input.shape[2]:, ...]

v1.1: Long Video Fine-tuning

Follow this training guidance to fine-tune on your custom long video data and note that:

Compared to VidTok v1.0, we tend to use longer sequences to fine-tune the model (for example, setting NUM_FRAMES_1 to 33, 49, or larger).
The resolution and the sequence length of training data should be adjusted according to GPU memory.

v1.1: Performance

| Model | Regularizer | Causal | VCR | PSNR | SSIM | LPIPS | FVD | |------|------|------|------|------|------|------|------| | vidtok_kl_causal_488_16chn_v1_1 | KL-16chn | ✔️ | 4x8x8 | 35.13 | 0.941 | 0.049 | 87.4 | | vidtok_kl_causal_41616_16chn_v1_1 | KL-16chn | ✔️ | 4x16x16 | 29.61 | 0.854 | 0.113 | 162.7 | | vidtok_kl_causal_288_8chn_v1_1 | KL-8chn | ✔️ | 2x8x8 | 34.59 | 0.935 | 0.051 | 78.2 | | vidtok_fsq_causal_488_32768_v1_1 | FSQ-32,768 | ✔️ | 4x8x8 | 29.39 | 0.856 | 0.114 | 168.5 | | vidtok_fsq_causal_888_32768_v1_1 | FSQ-32,768 | ✔️ | 8x8x8 | 27.95 | 0.817 | 0.142 | 293.2 |

This is the evaluation result of long video reconstruction conducted on each complete video in MCL_JCL dataset, with a sample fps of 30 and a resolution of 256x256.

🔧 Setup

Clone this repository and navigate to VidTok folder:

git clone https://github.com/microsoft/VidTok
cd VidTok

We provide an environment.yaml file for setting up a Conda environment. Conda's installation instructions are available here.

# 1. Prepare conda environment
conda env create -f environment.yaml
# 2. Activate the environment
conda activate vidtok

We recommend using 1+ high-end GPU for training and inference. We have done all testing and development using A100 and MI300X GPUs. For convenience, we also provide prebuilt Docker images with required dependencies. You can use it as follows:

# NVIDIA GPUs
docker run -it --gpus all --shm-size 256G --rm -v `pwd`:/workspace --workdir /workspace \
    deeptimhe/ubuntu22.04-cuda12.1-python3.10-pytorch2.5:orig-vidtok bash
# AMD GPUs
docker run -it --gpus all --shm-size 256G --rm -v `pwd`:/workspace --workdir /workspace \
    deeptimhe/ubuntu22.04-rocm6.2.4-python3.10-pytorch2.5:orig-vidtok bash

🎈 Checkpoints

Download pre-trained models here, and put them in checkpoints folder, like:

└── checkpoints
    ├── vidtok_v1_1
    │   ├── vidtok_kl_causal_488_16chn_v1_1.ckpt
    │   └── ...
    ├── vidtok_fsq_causal_41616_262144.ckpt
    ├── vidtok_fsq_causal_488_262144.ckpt
    ├── vidtok_fsq_causal_488_32768.ckpt
    ├── vidtok_fsq_causal_488_4096.ckpt
    ├── vidtok_fsq_noncausal_41616_262144.ckpt
    ├── vidtok_fsq_noncausal_488_262144.ckpt
    ├── vidtok_kl_causal_288_8chn.ckpt
    ├── vidtok_kl_causal_41616_4chn.ckpt
    ├── vidtok_kl_causal_444_4chn.ckpt
    ├── vidtok_kl_causal_488_16chn.ckpt
    ├── vidtok_kl_causal_488_4chn.ckpt
    ├── vidtok_kl_causal_488_8chn.ckpt
    ├── vidtok_kl_noncausal_41616_16chn.ckpt
    ├── vidtok_kl_noncausal_41616_4chn.ckpt
    ├── vidtok_kl_noncausal_488_16chn.ckpt
    └── vidtok_kl_noncausal_488_4chn.ckpt

Each checkpoint has a corresponding config file with the same name in configs folder.

🔆 Performance

| Model | Regularizer | Causal | VCR | PSNR | SSIM | LPIPS | FVD | |------|------|------|------|------|------|------|------| | vidtok_kl_causal_488_4chn | KL-4chn | ✔️ | 4x8x8 | 29.64 | 0.852| 0.114| 194.2| | vidtok_kl_causal_488_8chn | KL-8chn | ✔️ |4x8x8 | 31.83 | 0.897| 0.083| 109.3| | vidtok_kl_causal_488_16chn | KL-16chn | ✔️ | 4x8x8 | 35.04 |0.942 |0.047 | 78.9| | vidtok_kl_causal_288_8chn | KL-8chn | ✔️ | 2x8x8 | 33.86 | 0.928 |0.057 | 80.7 | | vidtok_kl_causal_444_4chn | KL-4chn | ✔️ | 4x4x4 | 34.78 | 0.941 | 0.051| 87.2| | vidtok_kl_causal_41616_4chn | KL-4chn | ✔️ | 4x16x16 | 25.05 | 0.711| 0.228| 549.1| | vidtok_kl_noncausal_488_4chn | KL-4chn | ✖️ | 4x8x8 | 30.60 | 0.876 | 0.098| 157.9| | vidtok_kl_noncausal_488_16chn | KL-16chn | ✖️ | 4x8x8 | 36.13 | 0.950 | 0.044| 60.5| | vidtok_kl_noncausal_41616_4chn | KL-4chn | ✖️ | 4x16x16 | 26.06 | 0.751 | 0.190|423.2 | | [vidtok_kl_noncausal_41616_16chn](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_kl_noncausal_41616

Related Skills

docs-writer

99.5k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

341.0k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

project-overview

FlightPHP Skeleton Project Instructions This document provides guidelines and best practices for structuring and developing a project using the FlightPHP framework. Instructions for AI Coding A

ddd

Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso

microsoft

View profile

View on GitHub

GitHub Stars441

CategoryContent

Updated2d ago

Forks18

microsoft/VidTok

Languages

Python

Security Score

95/100

Audited on Mar 28, 2026

No findings