VerseCrafter
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
Install / Use
/learn @TencentARC/VerseCrafterREADME
<a href="https://arxiv.org/pdf/2601.05138"><img src='https://img.shields.io/badge/arXiv-Paper-red?style=flat&logo=arXiv&logoColor=red' alt='arxiv'></a> <a href="https://github.com/TencentARC/VerseCrafter"><img src='https://img.shields.io/badge/GitHub-Code-blue?style=flat&logo=GitHub' alt='github'></a> <a href="https://huggingface.co/TencentARC/VerseCrafter"><img src='https://img.shields.io/badge/Hugging Face-ckpts-orange?style=flat&logo=HuggingFace&logoColor=orange' alt='huggingface'></a> <a href="https://sixiaozheng.github.io/VerseCrafter_page/"><img src='https://img.shields.io/badge/Project-Page-Green' alt='GitHub'></a>
<p align="center"> <a href="https://sixiaozheng.github.io/">Sixiao Zheng</a><sup>1,2</sup> <a href="#">Minghao Yin</a><sup>3</sup> <a href="https://wbhu.github.io/">Wenbo Hu</a><sup>4†</sup> <a href="https://xiaoyu258.github.io/">Xiaoyu Li</a><sup>4</sup> <a href="https://www.linkedin.com/in/YingShanProfile">Ying Shan</a><sup>4</sup> <a href="https://yanweifu.github.io/">Yanwei Fu</a><sup>1,2†</sup> </p> <p align="center"> <sup>1</sup>Fudan University <sup>2</sup>Shanghai Innovation Institute <sup>3</sup>HKU <sup>4</sup>ARC Lab, Tencent PCG </p> <p align="center"> <sup>†</sup>Corresponding authors </p> <p align="center"> <b>CVPR 2026</b> </p>✨ A controllable video world model with explicit 4D geometric control over camera and multi-object motion.
🔥 News
- [Feb 21, 2026] 🎉 VerseCrafter is accepted to CVPR 2026!
- [Jan 9, 2026] 🚀 VerseCrafter is released! We publish the arXiv preprint, inference code, and model checkpoints.
✅ TODO
- [x] Inference code
- [ ] Training code
- [ ] Data processing code
TL;DR
- Dynamic Realistic Video World Model: VerseCrafter learns a realistic and controllable video world prior from large-scale in-the-wild data, handling challenging dynamic scenes with strong spatial-temporal coherence.
- 4D Geometric Control: A unified 4D control state provides direct, interpretable control over camera motion, multi-object motion, and their joint coordination, improving geometric faithfulness.
- Frozen Video Prior + GeoAdapter: We attach a geometry-aware GeoAdapter to a frozen Wan2.1 backbone, injecting 4D controls into diffusion blocks for precise control without sacrificing video quality.
- VerseControl4D Dataset: We introduce a large-scale real-world dataset with automatically rendered camera trajectories and multi-object 3D Gaussian trajectories to supervise 4D controllable generation.
Installation
-
Clone the repository:
git clone --recursive https://github.com/TencentARC/VerseCrafter.git # If you have already cloned the repo, you can update the submodules manually: git submodule update --init --recursive cd VerseCrafter -
Create and activate the Conda environment:
conda create -n versecrafter python=3.11 -y conda activate versecrafter # Install PyTorch conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia -y # Install Python dependencies pip install -r requirements.txt # Install MoGe pip install git+https://github.com/microsoft/MoGe.git # Install Grounded-SAM-2 cd third_party/Grounded-SAM-2 pip install -e . pip install --no-build-isolation -e grounding_dino # Install flash attention pip install flash-attn --no-build-isolation # Install pytorch3d cd ../../ git clone https://github.com/facebookresearch/pytorch3d.git cd pytorch3d pip install --no-build-isolation . cd ../VerseCrafter
Download Checkpoints
-
Download VerseCrafter and Wan2.1 models:
pip install --upgrade huggingface_hub mkdir -p model hf download --local-dir model/VerseCrafter sxzheng/VerseCrafter hf download --local-dir model/Wan2.1-T2V-14B Wan-AI/Wan2.1-T2V-14B -
Download Grounded-SAM-2 and Grounding DINO checkpoints:
cd third_party/Grounded-SAM-2/checkpoints bash download_ckpts.sh cd ../gdino_checkpoints bash download_ckpts.sh cd ../../../
Usage
We provide two ways to use VerseCrafter:
| Method | Description | Pros | Cons | |--------|-------------|------|------| | Blender Addon | Deploy API server on GPU machine, call models directly from Blender | One-stop workflow, no context switching, visual trajectory editing | Requires network access to GPU server | | Script Pipeline | Run each step manually via command line | Works offline, full control over each step | Requires manual switching between terminal and Blender |
💡 Tip: We recommend the Blender Addon for most users. It supports proxy authentication for secure server access. If you cannot connect to a remote GPU server, use the Script Pipeline instead.
Option 1: Blender Addon (Recommended)


For detailed instructions, see README_BLENDER.md.
Prerequisites
- Blender 4.0+ (4.5+ recommended)
- A remote GPU server running the VerseCrafter API
Quick Start
-
Install the addon:
cd VerseCrafter zip -r blender_addon.zip blender_addon/In Blender: Edit → Preferences → Add-ons → ↓ → Install from Disk... → Select
blender_addon.zip→ Enable "VerseCrafter Workflow" -
Start the API server (on GPU server):
python api_server.py --port 8188 --num_gpus 8 -
Configure connection in Blender:
- Press
Nto open the sidebar → VerseCrafter tab - Set Server URL (e.g.,
http://<server-ip>:8188) - Click Test Connection
- Press
-
Run the workflow:
- Step 1: Select input image, set workflow directory, enter object prompt (e.g., "person . car ."), click "Run Preprocessing"
- Step 2: Edit camera and object trajectories visually, click "Export Trajectories"
- Step 3: Enter video prompt, click "Generate Video"
Option 2: Script Pipeline
The inference.sh script provides a complete pipeline for generating videos. You can run the steps individually or use the script as a reference.
1. Configuration
Edit inference.sh to set your input image, output directory, and prompt.
INPUT_IMAGE=demo_data/y57HgqX1uGc_0039750_0041550_0000635_0000716/0001.jpg
OUTPUT_DIR=demo_data/y57HgqX1uGc_0039750_0041550_0000635_0000716
MODEL_PATH="model/VerseCrafter"
2. Run the Pipeline
The pipeline consists of the following steps:
Step 1: Depth Estimation
Generate depth maps using MoGE-V2.
python inference/moge-v2_infer.py -i $INPUT_IMAGE -o $OUTPUT_DIR/estimated_depth --maps
Step 2: Segmentation
Segment objects using Grounded-SAM-2.
python inference/grounded_sam2_infer.py \
--image_path "$INPUT_IMAGE" \
--text_prompt "person . car ." \
--output_dir "$OUTPUT_DIR/object_mask" \
--min_area_ratio 0.003 \
--max_area_ratio 0.2
Step 3: Fit 3D Gaussian
Fit 3D Gaussians to the segmented objects.
python inference/fit_3D_gaussian.py \
--image_path $INPUT_IMAGE \
--npz_path $OUTPUT_DIR/estimated_depth/depth_intrinsics.npz \
--masks_dir $OUTPUT_DIR/object_mask/masks \
--output_dir $OUTPUT_DIR/fitted_3D_gaussian
The following are input image and its corresponding results:
| Input Image | Depth Map | Segmentation Mask | 3D Gaussian |
|-------------|-----------|-------------------|-------------|
|
|
|
|
|
Step 4: Customize Trajectory (Manual Operation in Blender)
This step requires Blender to interactively edit the 4D control scene. We also provide a demonstration video that shows step-by-step Blender operations for this process:
Watch the Blender operation video here
-
Prepare Scripts:
- Open
inference/blender_script/build_4d_control_scene.pyandinference/blender_script/export_blender_custom_trajectories.py. - Crucial: Update the
ROOT_DIRvariable in both scripts to the absolute path of your input directory (e.g.,/absolute/path/to/demo_data/your_folder).
- Open
-
Build Scene:
- Open Blender.
- Go to the Scripting tab.
- Open or paste the content of
build_4d_control_scene.py. - Run the script to load the scene (point cloud, camera, objects).
-
Customize Trajectories:
- Switch to the Layout tab.
- Camera Trajectory:
- Create a curve (e.g.,
Shift+A→ Curve → Bezier). - Switch to Edit Mode to draw or adjust the curve.
- Select the Camera, add a Follow Path constraint targeting the curve.
- Check Fixed Position.
- Set the animation duration to 81 frames.
- Create a curve (e.g.,
- 3D Gaussian (Object) Trajectory:
- Select the object (Ellipsoid).
- Use the same Follow Path method as the camera, or insert Keyframes (
Ikey) for location/rotation/scale.
-
Export Trajectories:
- Go back to the Scripting tab.
- Open or paste the content of
export_blender_custom_trajectories.py. - Run the script to export
custom_camera_trajectory.npzandcustom_3D_gaussian_trajectory.json.
This is an animation of custom trajectories in Blender:

