SkillAgentSearch skills...

VideoREPA

[NeurIPS 2025] VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

Install / Use

/learn @aHapBean/VideoREPA
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

VideoREPA (NeurIPS 2025)

Project Page | Paper

VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng
NeurIPS 2025

A step towards more reliable world modeling by enhancing physics plausibility in video generation.

|VideoPhy|SA|PC| -|-|- CogVideoX-5B | 70.0 | 32.3 | +REPA Loss+DINOv2 | 62.5 ⚠️ | 33.7 | +REPA Loss+VideoMAEv2 | 59.3 ⚠️ | 35.5 | +TRD Loss+VideoMAEv2 (ours) | 72.1 | 40.1

📰 News

  • 🎉 Sept, 2025: VideoREPA is accepted by NuerIPS 2025.
  • 💡 Feb, 2026: Our work DreamWorld is available on Arxiv, a unified framework that integrates complementary world knowledge into video generators via a Joint World Modeling Paradigm.

✅ Project Status

🎉 Accepted to NeurIPS 2025!

  • [x] Release introduction & visual results
  • [x] Release training & inference code
  • [x] Upload checkpoints and provide reproducing tips.
  • [x] Release evaluation code.
  • [x] Release generated videos of VideoREPA. Please refer to the Google Drive.

If you find VideoREPA useful, please consider giving us a star ⭐.

Introduction

<div align="center"> <div style="display: flex; justify-content: space-between;"> <img src="https://github.com/user-attachments/assets/7e65716b-27cd-45e1-b4df-1f4c1c7c3d33" alt="test" width="35%" /> <img src="https://github.com/user-attachments/assets/1952c95f-5453-42d9-84ec-80f49565a961" alt="test" width="35%" /> </div> </div> <p align="center"> Figure 1. Evaluation of physics understanding on the Physion benchmark. The chance performance if 50%. </p>

🔍 Physics Understanding Gap: We identify an essential gap in physics understanding between self-supervised VFMs and T2V models, proposing the first method to bridge video understanding models and T2V models. VideoREPA demonstrates that “understanding helps generation.” in video generation field.

Overview

<div align="center"> <div style="display: flex; justify-content: space-between;"> <img src="https://github.com/user-attachments/assets/4a55f50c-cc02-4467-8b84-4f83ed37869e" alt="test" width="70%" /> </div> </div> <p align="center"> Figure 2. Overview of VideoREPA. </p>

VideoREPA enhances physics plausibility in T2V models through Token Relation Distillation (TRD) — a loss that aligns pairwise token relations between self-supervised video encoders and diffusion transformer features.

Each token learns relations about both:

  • Spatial relations within a frame
  • Temporal relations across frames

🌟 Novelty: VideoREPA is the first successful adaptation of REPA into video generation — overcoming key challenges in finetuning large pretrained video diffusion transformers and maintaining temporal consistency.

Qualitative Results

<table align="center" style="width: 100%;"> <tr> <th align="center" style="width: 25%;">CogVideoX</th> <th align="center" style="width: 25%;">CogVideoX+REPA loss</th> <th align="center" style="width: 25%;">VideoREPA</th> <th align="center" style="width: 25%;">Prompt</th> </tr> <tr> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/b0f6b65d-3b0b-4665-88a9-8fc81a23c613" controls autoplay loop muted></video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/00276b57-e3ea-4f30-b0a7-6522f4dedd31" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/92b51720-f4d8-4867-8c1c-3fb88e2f5e67" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> Leather glove catching a hard baseball. </td> </tr> <tr> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/a199b5ab-0829-41de-ab72-2e17ac66f069" controls autoplay loop muted></video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/e5d61296-0b9b-4567-aa27-b986234ce870" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/5acfd53c-83f7-4e18-bb37-b5eec8dcf226" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> Maple syrup drizzling from a bottle onto pancakes. </td> </tr> <tr> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/d93283f9-9dff-41b0-8837-d93d06d06356" controls autoplay loop muted></video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/2be513e6-8a7f-4199-bb1e-c411fcda14ac" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/25dca90d-d91b-4ffe-8fc9-3c340c816d95" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> Glass shatters on the floor. </td> </tr> <tr> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/3095d887-4cfb-4726-8152-56d6aa72de40" controls autoplay loop muted></video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/2b96fcde-f371-400f-9459-ca223d237c73" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/89299f65-fb1e-4013-969f-bc1c7e715523" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> A child runs and catches a brightly colored frisbee... </td> </tr> </table>

⚙️ Quick start

Environment setup

git clone https://github.com/aHapBean/VideoREPA.git

conda create --name videorepa python=3.10
conda activate videorepa

cd VideoREPA
pip install -r requirements.txt

# Install diffusers locally (recommended)
cd ./finetune/diffusers
pip install -e .

Dataset download

Download the OpenVid dataset used in VideoREPA. We use parts 30–49 and select subsets containing 32K and 64K videos, respectively. The corresponding CSV files are located in ./finetune/openvid/.

pip install -U huggingface_hub

# Download parts 30–49
huggingface-cli download --repo-type dataset nkp37/OpenVid-1M \
--local-dir ./finetune/openvid \
--include "OpenVid_part3[0-9].zip"

huggingface-cli download --repo-type dataset nkp37/OpenVid-1M \
--local-dir ./finetune/openvid \
--include "OpenVid_part4[0-9].zip"

Then unzip into ./finetune/openvid/videos/.

Training

# Download pretrained CogVideoX checkpoints
huggingface-cli download --repo-type model zai-org/CogVideoX-2b --local-dir ./ckpt/cogvideox-2b
huggingface-cli download --repo-type model zai-org/CogVideoX-5b --local-dir ./ckpt/cogvideox-5b

# Download pretrained vision encoder such as VideoMAEv2, VJEPA and put them into ./ckpt/. Such as ./ckpt/VideoMAEv2/vit_b_k710_dl_from_giant.pth

# Precompute video cache (shared for 2B/5B)
cd finetune/
bash scripts/dataset_precomputing.sh

# Training (adjust GPU count in scripts)
bash scripts/multigpu_VideoREPA_2B_sft.sh
bash scripts/multigpu_VideoREPA_5B_lora.sh

Inference

Inference with the VideoREPA

# Transform checkpoint to diffuser format (only for sft)
# Put the scripts/merge.sh into the saved checkpoint-xxx/ and run:
bash merge.sh

# Then copy cogvideox-2b/ from ckpt/ to cogvideox-2b-infer/
# Delete the original transformer dir in cogvideox-2b-infer/
# Move the transformed transformer dir into it

# Modify model_index.config in cogvideox-2b-infer/
# "transformer": [
#   "models.cogvideox_align",
#   "CogVideoXTransformer3DModelAlign"
# ],

# Inference
cd inference/
bash scripts/infer_videorepa_2b_sft.sh
# bash scripts/infer_videorepa_5b_lora.sh

Or run inference directly with our released checkpoints. Please download the weights from Huggingface and

  • For VideoREPA-5B, place pytorch_lora_weights.safetensors in ./inference/

  • For VideoREPA-2B, place the transformer directory inside ./ckpt/cogvideox-2b-infer/

huggingface-cli download --repo-type model aHapBean/VideoREPA --local-dir ./

Reproducing tips

We provide guidance for convenient results reproduction.

All experiments use seed = 42 by default in our paper. However, note that randomness exists in both video generation and VideoPhy evaluation, so identical results across different devices (e.g., GPUs) may not be perfectly reproducible even with the same seed.

To reproduce demo videos, simply download the released VideoREPA checkpoints and run inference — similar videos c

Related Skills

View on GitHub
GitHub Stars176
CategoryContent
Updated1d ago
Forks8

Languages

Python

Security Score

100/100

Audited on Mar 26, 2026

No findings