VideoREPA
[NeurIPS 2025] VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
Install / Use
/learn @aHapBean/VideoREPAREADME
VideoREPA (NeurIPS 2025)
Project Page | Paper
VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models
Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng
NeurIPS 2025
✨ A step towards more reliable world modeling by enhancing physics plausibility in video generation.
|VideoPhy|SA|PC| -|-|- CogVideoX-5B | 70.0 | 32.3 | +REPA Loss+DINOv2 | 62.5 ⚠️ | 33.7 | +REPA Loss+VideoMAEv2 | 59.3 ⚠️ | 35.5 | +TRD Loss+VideoMAEv2 (ours) | 72.1 | 40.1
📰 News
- 🎉 Sept, 2025: VideoREPA is accepted by NuerIPS 2025.
- 💡 Feb, 2026: Our work DreamWorld is available on Arxiv, a unified framework that integrates complementary world knowledge into video generators via a Joint World Modeling Paradigm.
✅ Project Status
🎉 Accepted to NeurIPS 2025!
- [x] Release introduction & visual results
- [x] Release training & inference code
- [x] Upload checkpoints and provide reproducing tips.
- [x] Release evaluation code.
- [x] Release generated videos of VideoREPA. Please refer to the Google Drive.
If you find VideoREPA useful, please consider giving us a star ⭐.
Introduction
<div align="center"> <div style="display: flex; justify-content: space-between;"> <img src="https://github.com/user-attachments/assets/7e65716b-27cd-45e1-b4df-1f4c1c7c3d33" alt="test" width="35%" /> <img src="https://github.com/user-attachments/assets/1952c95f-5453-42d9-84ec-80f49565a961" alt="test" width="35%" /> </div> </div> <p align="center"> Figure 1. Evaluation of physics understanding on the Physion benchmark. The chance performance if 50%. </p>🔍 Physics Understanding Gap: We identify an essential gap in physics understanding between self-supervised VFMs and T2V models, proposing the first method to bridge video understanding models and T2V models. VideoREPA demonstrates that “understanding helps generation.” in video generation field.
Overview
<div align="center"> <div style="display: flex; justify-content: space-between;"> <img src="https://github.com/user-attachments/assets/4a55f50c-cc02-4467-8b84-4f83ed37869e" alt="test" width="70%" /> </div> </div> <p align="center"> Figure 2. Overview of VideoREPA. </p>VideoREPA enhances physics plausibility in T2V models through Token Relation Distillation (TRD) — a loss that aligns pairwise token relations between self-supervised video encoders and diffusion transformer features.
Each token learns relations about both:
- Spatial relations within a frame
- Temporal relations across frames
🌟 Novelty: VideoREPA is the first successful adaptation of REPA into video generation — overcoming key challenges in finetuning large pretrained video diffusion transformers and maintaining temporal consistency.
Qualitative Results
<table align="center" style="width: 100%;"> <tr> <th align="center" style="width: 25%;">CogVideoX</th> <th align="center" style="width: 25%;">CogVideoX+REPA loss</th> <th align="center" style="width: 25%;">VideoREPA</th> <th align="center" style="width: 25%;">Prompt</th> </tr> <tr> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/b0f6b65d-3b0b-4665-88a9-8fc81a23c613" controls autoplay loop muted></video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/00276b57-e3ea-4f30-b0a7-6522f4dedd31" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/92b51720-f4d8-4867-8c1c-3fb88e2f5e67" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> Leather glove catching a hard baseball. </td> </tr> <tr> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/a199b5ab-0829-41de-ab72-2e17ac66f069" controls autoplay loop muted></video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/e5d61296-0b9b-4567-aa27-b986234ce870" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/5acfd53c-83f7-4e18-bb37-b5eec8dcf226" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> Maple syrup drizzling from a bottle onto pancakes. </td> </tr> <tr> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/d93283f9-9dff-41b0-8837-d93d06d06356" controls autoplay loop muted></video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/2be513e6-8a7f-4199-bb1e-c411fcda14ac" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/25dca90d-d91b-4ffe-8fc9-3c340c816d95" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> Glass shatters on the floor. </td> </tr> <tr> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/3095d887-4cfb-4726-8152-56d6aa72de40" controls autoplay loop muted></video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/2b96fcde-f371-400f-9459-ca223d237c73" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> <video width="25%" controls src="https://github.com/user-attachments/assets/89299f65-fb1e-4013-969f-bc1c7e715523" controls autoplay loop muted>Your browser does not support the video tag.</video> </td> <td align="center" style="width: 25%;"> A child runs and catches a brightly colored frisbee... </td> </tr> </table>⚙️ Quick start
Environment setup
git clone https://github.com/aHapBean/VideoREPA.git
conda create --name videorepa python=3.10
conda activate videorepa
cd VideoREPA
pip install -r requirements.txt
# Install diffusers locally (recommended)
cd ./finetune/diffusers
pip install -e .
Dataset download
Download the OpenVid dataset used in VideoREPA. We use parts 30–49 and select subsets containing 32K and 64K videos, respectively. The corresponding CSV files are located in ./finetune/openvid/.
pip install -U huggingface_hub
# Download parts 30–49
huggingface-cli download --repo-type dataset nkp37/OpenVid-1M \
--local-dir ./finetune/openvid \
--include "OpenVid_part3[0-9].zip"
huggingface-cli download --repo-type dataset nkp37/OpenVid-1M \
--local-dir ./finetune/openvid \
--include "OpenVid_part4[0-9].zip"
Then unzip into ./finetune/openvid/videos/.
Training
# Download pretrained CogVideoX checkpoints
huggingface-cli download --repo-type model zai-org/CogVideoX-2b --local-dir ./ckpt/cogvideox-2b
huggingface-cli download --repo-type model zai-org/CogVideoX-5b --local-dir ./ckpt/cogvideox-5b
# Download pretrained vision encoder such as VideoMAEv2, VJEPA and put them into ./ckpt/. Such as ./ckpt/VideoMAEv2/vit_b_k710_dl_from_giant.pth
# Precompute video cache (shared for 2B/5B)
cd finetune/
bash scripts/dataset_precomputing.sh
# Training (adjust GPU count in scripts)
bash scripts/multigpu_VideoREPA_2B_sft.sh
bash scripts/multigpu_VideoREPA_5B_lora.sh
Inference
Inference with the VideoREPA
# Transform checkpoint to diffuser format (only for sft)
# Put the scripts/merge.sh into the saved checkpoint-xxx/ and run:
bash merge.sh
# Then copy cogvideox-2b/ from ckpt/ to cogvideox-2b-infer/
# Delete the original transformer dir in cogvideox-2b-infer/
# Move the transformed transformer dir into it
# Modify model_index.config in cogvideox-2b-infer/
# "transformer": [
# "models.cogvideox_align",
# "CogVideoXTransformer3DModelAlign"
# ],
# Inference
cd inference/
bash scripts/infer_videorepa_2b_sft.sh
# bash scripts/infer_videorepa_5b_lora.sh
Or run inference directly with our released checkpoints. Please download the weights from Huggingface and
-
For VideoREPA-5B, place
pytorch_lora_weights.safetensorsin./inference/ -
For VideoREPA-2B, place the transformer directory inside
./ckpt/cogvideox-2b-infer/
huggingface-cli download --repo-type model aHapBean/VideoREPA --local-dir ./
Reproducing tips
We provide guidance for convenient results reproduction.
All experiments use seed = 42 by default in our paper. However, note that randomness exists in both video generation and VideoPhy evaluation, so identical results across different devices (e.g., GPUs) may not be perfectly reproducible even with the same seed.
To reproduce demo videos, simply download the released VideoREPA checkpoints and run inference — similar videos c
Related Skills
docs-writer
99.3k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
338.0kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
ddd
Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso
zola-ai
An autonomous Solana wallet agent that executes payments via Twitter mentions and an in-app dashboard, powered by Claude.
