VideoScene
[CVPR 2025 Highlight] VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Install / Use
/learn @THU-SI/VideoSceneREADME
<a><img src='https://img.shields.io/badge/License-MIT-blue'></a>
<a href='https://mp.weixin.qq.com/s/u6OUo5mHKPG6I3yYJPMC8Q'><img src='https://img.shields.io/badge/%E5%BE%AE%E4%BF%A1-%E4%B8%AD%E6%96%87%E4%BB%8B%E7%BB%8D-green'></a>
https://github.com/user-attachments/assets/dca733b1-b78f-49ac-ae47-5d1b1e8a689b
Building on ReconX, VideoScene has achieved a turbo-version advancement.
Installation
To get started, clone this project, create a conda virtual environment using Python 3.10+, and install the requirements:
- Clone VideoScene.
git clone https://github.com/hanyang-21/VideoScene
cd VideoScene
- Create the environment, here we show an example using conda.
conda create -y -n videoscene python=3.10
conda activate videoscene
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
- Optional, compile the cuda kernels for RoPE (as in CroCo v2).
# NoPoSplat relies on RoPE positional embeddings for which you can compile some cuda kernels for faster runtime.
cd src/model/encoder/backbone/croco/curope/
python setup.py build_ext --inplace
cd ../../../../../..
Acquiring Datasets
RealEstate10K and ACID
Our VideoScene uses the same training datasets as pixelSplat. Below we quote pixelSplat's detailed instructions on getting datasets.
pixelSplat was trained using versions of the RealEstate10k and ACID datasets that were split into ~100 MB chunks for use on server cluster file systems. Small subsets of the Real Estate 10k and ACID datasets in this format can be found here. To use them, simply unzip them into a newly created
datasetsfolder in the project root directory.
If you would like to convert downloaded versions of the Real Estate 10k and ACID datasets to our format, you can use the scripts here. Reach out to us (pixelSplat) if you want the full versions of our processed datasets, which are about 500 GB and 160 GB for Real Estate 10k and ACID respectively.
Downloading checkpoints
-
download our pretrained weights (
VideoScene/checkpoints/model.safetensorsandVideoScene/checkpoints/prompt_embeds.pt), and save them tocheckpoints. -
for customized image inputs, get the NoPoSplat pretrained models, and save them to
checkpoints/noposplat. -
for RealEstate10K datasets, get the MVSplat pretrained models, and save them to
checkpoints/mvsplat.
Running the Code
Gradio Demo
In this demo, you can run VideoScene on your machine to generate a video with unposed input views.
- select image pairs that depicts the same scene and hit "RUN" for a video of the scene.
python -m noposplat.src.app \
checkpointing.load=checkpoints/noposplat/mixRe10kDl3dv_512x512.ckpt \
test.video=checkpoints/model.safetensors
# also "bash demo.sh"
- the generated video will be stored under
outputs/gradio
Inference
To generate videos on RealEstate10K dataseets, we use a MVSplat pretrained model,
- run the following:
# re10k
python -m mvsplat.src.main +experiment=re10k \
checkpointing.load=checkpoints/mvsplat/re10k.ckpt \
mode=test \
dataset/view_sampler=evaluation \
dataset.view_sampler.index_path=mvsplat/assets/evaluation_index_re10k_video.json \
test.save_video=true \
test.save_image=false \
test.compute_scores=false \
test.video=checkpoints/model.safetensors
# also "bash inference.sh"
- the generated video will be stored under
outputs/test
BibTeX
@misc{wang2025videoscenedistillingvideodiffusion,
title={VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step},
author={Hanyang Wang and Fangfu Liu and Jiawei Chi and Yueqi Duan},
year={2025},
eprint={2504.01956},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.01956},
}
Acknowledgements
This project is developed with several fantastic repos: ReconX, MVSplat, NoPoSplat, CogVideo, and CogvideX-Interpolation. Many thanks to these projects for their excellent contributions!
