SkillAgentSearch skills...

VidTok

a family of versatile and state-of-the-art video tokenizers.

Install / Use

/learn @microsoft/VidTok
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center">

VidTok <br> <sub>A Family of Versatile and State-Of-The-Art Video Tokenizers</sub>

arXivGitHubHuggingFace

</div>

radar

We introduce VidTok, a cutting-edge family of video tokenizers that excels in both continuous and discrete tokenizations. VidTok incorporates several key advancements over existing approaches:

  • ⚡️ Efficient Architecture. Separate spatial and temporal sampling reduces computational complexity without sacrificing quality.
  • 🔥 Advanced Quantization. Finite Scalar Quantization (FSQ) addresses training instability and codebook collapse in discrete tokenization.
  • 💥 Enhanced Training. A two-stage strategy—pre-training on low-res videos and fine-tuning on high-res—boosts efficiency. Reduced frame rates improve motion dynamics representation.

VidTok, trained on a large-scale video dataset, outperforms previous models across all metrics, including PSNR, SSIM, LPIPS, and FVD.

https://github.com/user-attachments/assets/a3341037-130d-4a83-aba6-c3daeaf66932

🔥 News

  • August, 2025: 🚀 Introduced spatial tiling for large resolutions (>256), reducing GPU memory usage to ~6 GB when encoding and decoding a 17 × 768 × 768 video.
  • March, 2025: 🚀 VidTwin has been accepted by CVPR 2025, and the checkpoint was released!
  • March, 2025: 🚀 VidTok v1.1 was released! We fine-tuned all causal models on long videos to support tokenization and reconstruction of videos of arbitrary length with fine temporal smoothness. Relevant checkpoints are continuously updating.
  • December, 2024: 🚀 VidTwin was released!
  • December, 2024: 🚀 VidTok was released!

💥 Updates in VidTok v1.1

VidTok v1.1 is an update for causal models. We fine-tuned all causal models on long videos to support tokenization and reconstruction of videos of arbitrary length with fine temporal smoothness. See performance here.

v1.1: Long Video Reconstruction

Run the following inference script to reconstruct an input video:

python scripts/inference_reconstruct.py --config CONFIG_v1_1 --ckpt CKPT_v1_1 --input_video_path VIDEO_PATH --input_height 256 --input_width 256 --sample_fps 30 --chunk_size CHUNK_SIZE --output_video_dir OUTPUT_DIR --read_long_video
# Set `CHUNK_SIZE` according to your GPU memory, recommendly 16.

and run the following inference script to evaluate the reconstruction performance:

python scripts/inference_evaluate.py --config CONFIG_v1_1 --ckpt CKPT_v1_1 --data_dir DATA_DIR --input_height 256 --input_width 256 --sample_fps 30 --chunk_size CHUNK_SIZE --read_long_video
# Set `CHUNK_SIZE` according to your GPU memory, recommendly 16.

For an easy usage of VidTok v1.1 models, refer to this script and make the following revision:

# Use VidTok v1.1 models
cfg_path = "configs/vidtok_v1_1/vidtok_kl_causal_488_4chn_v1_1.yaml"
ckpt_path = "checkpoints/vidtok_v1_1/vidtok_kl_causal_488_4chn_v1_1.ckpt"

...

model.to('cuda').eval()
# Using tiling inference to save memory usage
model.use_tiling = True
model.t_chunk_enc = 16
model.t_chunk_dec = model.t_chunk_enc // model.encoder.time_downsample_factor
model.use_overlap = True
# random input: long video
x_input = (torch.rand(1, 3, 129, 256, 256) * 2 - 1).to('cuda') 

...

if x_recon.shape[2] != x_input.shape[2]:
    x_recon = x_recon[:, :, -x_input.shape[2]:, ...]

v1.1: Long Video Fine-tuning

Follow this training guidance to fine-tune on your custom long video data and note that:

  • Compared to VidTok v1.0, we tend to use longer sequences to fine-tune the model (for example, setting NUM_FRAMES_1 to 33, 49, or larger).
  • The resolution and the sequence length of training data should be adjusted according to GPU memory.

v1.1: Performance

| Model | Regularizer | Causal | VCR | PSNR | SSIM | LPIPS | FVD | |------|------|------|------|------|------|------|------| | vidtok_kl_causal_488_16chn_v1_1 | KL-16chn | ✔️ | 4x8x8 | 35.13 | 0.941 | 0.049 | 87.4 | | vidtok_kl_causal_41616_16chn_v1_1 | KL-16chn | ✔️ | 4x16x16 | 29.61 | 0.854 | 0.113 | 162.7 | | vidtok_kl_causal_288_8chn_v1_1 | KL-8chn | ✔️ | 2x8x8 | 34.59 | 0.935 | 0.051 | 78.2 | | vidtok_fsq_causal_488_32768_v1_1 | FSQ-32,768 | ✔️ | 4x8x8 | 29.39 | 0.856 | 0.114 | 168.5 | | vidtok_fsq_causal_888_32768_v1_1 | FSQ-32,768 | ✔️ | 8x8x8 | 27.95 | 0.817 | 0.142 | 293.2 |

  • This is the evaluation result of long video reconstruction conducted on each complete video in MCL_JCL dataset, with a sample fps of 30 and a resolution of 256x256.

🔧 Setup

  1. Clone this repository and navigate to VidTok folder:
git clone https://github.com/microsoft/VidTok
cd VidTok
  1. We provide an environment.yaml file for setting up a Conda environment. Conda's installation instructions are available here.
# 1. Prepare conda environment
conda env create -f environment.yaml
# 2. Activate the environment
conda activate vidtok

We recommend using 1+ high-end GPU for training and inference. We have done all testing and development using A100 and MI300X GPUs. For convenience, we also provide prebuilt Docker images with required dependencies. You can use it as follows:

# NVIDIA GPUs
docker run -it --gpus all --shm-size 256G --rm -v `pwd`:/workspace --workdir /workspace \
    deeptimhe/ubuntu22.04-cuda12.1-python3.10-pytorch2.5:orig-vidtok bash
# AMD GPUs
docker run -it --gpus all --shm-size 256G --rm -v `pwd`:/workspace --workdir /workspace \
    deeptimhe/ubuntu22.04-rocm6.2.4-python3.10-pytorch2.5:orig-vidtok bash

🎈 Checkpoints

Download pre-trained models here, and put them in checkpoints folder, like:

└── checkpoints
    ├── vidtok_v1_1
    │   ├── vidtok_kl_causal_488_16chn_v1_1.ckpt
    │   └── ...
    ├── vidtok_fsq_causal_41616_262144.ckpt
    ├── vidtok_fsq_causal_488_262144.ckpt
    ├── vidtok_fsq_causal_488_32768.ckpt
    ├── vidtok_fsq_causal_488_4096.ckpt
    ├── vidtok_fsq_noncausal_41616_262144.ckpt
    ├── vidtok_fsq_noncausal_488_262144.ckpt
    ├── vidtok_kl_causal_288_8chn.ckpt
    ├── vidtok_kl_causal_41616_4chn.ckpt
    ├── vidtok_kl_causal_444_4chn.ckpt
    ├── vidtok_kl_causal_488_16chn.ckpt
    ├── vidtok_kl_causal_488_4chn.ckpt
    ├── vidtok_kl_causal_488_8chn.ckpt
    ├── vidtok_kl_noncausal_41616_16chn.ckpt
    ├── vidtok_kl_noncausal_41616_4chn.ckpt
    ├── vidtok_kl_noncausal_488_16chn.ckpt
    └── vidtok_kl_noncausal_488_4chn.ckpt

Each checkpoint has a corresponding config file with the same name in configs folder.

🔆 Performance

| Model | Regularizer | Causal | VCR | PSNR | SSIM | LPIPS | FVD | |------|------|------|------|------|------|------|------| | vidtok_kl_causal_488_4chn | KL-4chn | ✔️ | 4x8x8 | 29.64 | 0.852| 0.114| 194.2| | vidtok_kl_causal_488_8chn | KL-8chn | ✔️ |4x8x8 | 31.83 | 0.897| 0.083| 109.3| | vidtok_kl_causal_488_16chn | KL-16chn | ✔️ | 4x8x8 | 35.04 |0.942 |0.047 | 78.9| | vidtok_kl_causal_288_8chn | KL-8chn | ✔️ | 2x8x8 | 33.86 | 0.928 |0.057 | 80.7 | | vidtok_kl_causal_444_4chn | KL-4chn | ✔️ | 4x4x4 | 34.78 | 0.941 | 0.051| 87.2| | vidtok_kl_causal_41616_4chn | KL-4chn | ✔️ | 4x16x16 | 25.05 | 0.711| 0.228| 549.1| | vidtok_kl_noncausal_488_4chn | KL-4chn | ✖️ | 4x8x8 | 30.60 | 0.876 | 0.098| 157.9| | vidtok_kl_noncausal_488_16chn | KL-16chn | ✖️ | 4x8x8 | 36.13 | 0.950 | 0.044| 60.5| | vidtok_kl_noncausal_41616_4chn | KL-4chn | ✖️ | 4x16x16 | 26.06 | 0.751 | 0.190|423.2 | | [vidtok_kl_noncausal_41616_16chn](https://huggingface.co/microsoft/VidTok/blob/main/checkpoints/vidtok_kl_noncausal_41616

Related Skills

View on GitHub
GitHub Stars441
CategoryContent
Updated2d ago
Forks18

Languages

Python

Security Score

95/100

Audited on Mar 28, 2026

No findings