<p align="center"> <h1 align="center">VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step</h1> <p align="center"> <a href="https://hanyang-21.github.io/">Hanyang Wang</a><sup>*</sup>, <a href="https://liuff19.github.io/">Fangfu Liu</a><sup>*</sup>, <a href="https://github.com/hanyang-21/VideoScene">Jiawei Chi</a>, <a href="https://duanyueqi.github.io/">Yueqi Duan</a> <br> <sup>*</sup>Equal Contribution. <br> Tsinghua University </p> <h3 align="center">CVPR 2025 Hightlight 🔥</h3> <h5 align="center">

</h5>   </p> <div align="center"> VideoScene is a one-step video diffusion model that bridges the gap from video to 3D. </div> </br>

https://github.com/user-attachments/assets/dca733b1-b78f-49ac-ae47-5d1b1e8a689b

Building on ReconX, VideoScene has achieved a turbo-version advancement.

Installation

To get started, clone this project, create a conda virtual environment using Python 3.10+, and install the requirements:

Clone VideoScene.

git clone https://github.com/hanyang-21/VideoScene
cd VideoScene

Create the environment, here we show an example using conda.

conda create -y -n videoscene python=3.10
conda activate videoscene
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

Optional, compile the cuda kernels for RoPE (as in CroCo v2).

# NoPoSplat relies on RoPE positional embeddings for which you can compile some cuda kernels for faster runtime.
cd src/model/encoder/backbone/croco/curope/
python setup.py build_ext --inplace
cd ../../../../../..

Acquiring Datasets

RealEstate10K and ACID

Our VideoScene uses the same training datasets as pixelSplat. Below we quote pixelSplat's detailed instructions on getting datasets.

pixelSplat was trained using versions of the RealEstate10k and ACID datasets that were split into ~100 MB chunks for use on server cluster file systems. Small subsets of the Real Estate 10k and ACID datasets in this format can be found here. To use them, simply unzip them into a newly created datasets folder in the project root directory.

If you would like to convert downloaded versions of the Real Estate 10k and ACID datasets to our format, you can use the scripts here. Reach out to us (pixelSplat) if you want the full versions of our processed datasets, which are about 500 GB and 160 GB for Real Estate 10k and ACID respectively.

Downloading checkpoints

download our pretrained weights (VideoScene/checkpoints/model.safetensors and VideoScene/checkpoints/prompt_embeds.pt), and save them to checkpoints.
for customized image inputs, get the NoPoSplat pretrained models, and save them to checkpoints/noposplat.
for RealEstate10K datasets, get the MVSplat pretrained models, and save them to checkpoints/mvsplat.

Running the Code

Gradio Demo

In this demo, you can run VideoScene on your machine to generate a video with unposed input views.

select image pairs that depicts the same scene and hit "RUN" for a video of the scene.

python -m noposplat.src.app \
    checkpointing.load=checkpoints/noposplat/mixRe10kDl3dv_512x512.ckpt \
    test.video=checkpoints/model.safetensors

# also "bash demo.sh"

the generated video will be stored under outputs/gradio

Inference

To generate videos on RealEstate10K dataseets, we use a MVSplat pretrained model,

run the following:

# re10k
python -m mvsplat.src.main +experiment=re10k \
checkpointing.load=checkpoints/mvsplat/re10k.ckpt \
mode=test \
dataset/view_sampler=evaluation \
dataset.view_sampler.index_path=mvsplat/assets/evaluation_index_re10k_video.json \
test.save_video=true \
test.save_image=false \
test.compute_scores=false \
test.video=checkpoints/model.safetensors

# also "bash inference.sh"

the generated video will be stored under outputs/test

BibTeX

@misc{wang2025videoscenedistillingvideodiffusion,
      title={VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step}, 
      author={Hanyang Wang and Fangfu Liu and Jiawei Chi and Yueqi Duan},
      year={2025},
      eprint={2504.01956},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.01956}, 
}

Acknowledgements

This project is developed with several fantastic repos: ReconX, MVSplat, NoPoSplat, CogVideo, and CogvideX-Interpolation. Many thanks to these projects for their excellent contributions!

VideoScene

Install / Use

README