SkillAgentSearch skills...

DiffusionAsShader

[SIGGRAPH 2025] Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

Install / Use

/learn @IGL-HKUST/DiffusionAsShader
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

Version   <a href='https://arxiv.org/abs/2501.03847'><img src='https://img.shields.io/badge/arXiv-2501.03847-b31b1b.svg'></a>   <a href='https://igl-hkust.github.io/das/'><img src='https://img.shields.io/badge/Project-Page-Green'></a>   HuggingFace Model  HuggingFace Spaces

teaser

NEWS:

  • Jun 5, 2025: We released our script and Blender project for creating synthetic datasets.

  • Jun 2, 2025: We added inference code based on Wan2.1Fun 1.3B fine-tuning to the Wanfun branch.

  • Apr 2, 2025: Added functionality for complex and precise camera control for videos, based on VGGT. The --override_extrinsics hyperparameter can be adjusted to append or override camera motion in videos.

  • Apr 1, 2025: Added support for cotracker.

  • Feb 17, 2025: We uploaded a validation dataset to Google Drive, containing 4 tasks.

Quickstart

Create environment

  1. Clone the repository and create conda environment:

    git clone https://github.com/IGL-HKUST/DiffusionAsShader.git
    conda create -n das python=3.10
    conda activate das
    
  2. Install pytorch, we recommend Pytorch 2.5.1 with CUDA 11.8:

    pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu118
    
<!-- 3. Install `MoGe`: ``` pip install git+https://github.com/asomoza/image_gen_aux.git ``` -->
  1. Make sure the submodule and requirements are installed:

    mkdir -p submodules
    git submodule update --init --recursive
    pip install -r requirements.txt
    

    If the submodules are not installed, you need to manually download them and move them to submodules/. Run the following commands to install the submodules:

    # MoGe
    git clone https://github.com/microsoft/MoGe.git submodules/MoGe
    # VGGT
    git clone https://github.com/facebookresearch/vggt.git submodules/vggt
    
  2. Manually download these checkpoints to checkpoints/:

    • SpatialTracker checkpoint: Google Drive.
    • Our Diffusion as Shader checkpoint: https://huggingface.co/EXCAI/Diffusion-As-Shader
<!-- 5. Manually download the ZoeDepth checkpoints (dpt_beit_large_384.pt, ZoeD_M12_K.pt, ZoeD_M12_NK.pt) to `models/monoD/zoeDepth/ckpts/`. For more information, refer to [this issue](https://github.com/henry123-boy/SpaTracker/issues/20). --> <!-- Then download a dataset: ```bash # install `huggingface_hub` huggingface-cli download \ --repo-type dataset Wild-Heart/Disney-VideoGeneration-Dataset \ --local-dir video-dataset-disney ``` -->

Inference

The inference code was tested on

  • Ubuntu 20.04
  • Python 3.10
  • PyTorch 2.5.1
  • 1 NVIDIA H800 with CUDA version 11.8. (32GB GPU memory is sufficient for generating videos with our code.)

We provide a inference script for our tasks. You can run the demo.py script directly as follows. We also provide a validation dataset in Google Drive for our 4 tasks. You can run the scripts/evaluate_DaS.sh to evaluate the performance of our model.

We release the gradio interface for our tasks. You can run the webui.py script directly as follows.

python webui.py --gpu <gpu_id>

Or you can run these tasks one by one as follows.

1. Motion Transfer

<table border="1"> <tr> <th>Original</th> <th>Object Replacement</th> <th>Style Transfer</th> </tr> <tr> <td><img src="assets/videos/rocket1.gif" alt="Original"></td> <td><img src="assets/videos/rocket2.gif" alt="Object Replacement"></td> <td><img src="assets/videos/rocket3.gif" alt="Style Transfer"></td> </tr> <tr> <th>Original</th> <th>Character Replacement 1</th> <th>Character Replacement 2</th> </tr> <tr> <td><img src="assets/videos/roof_girl.gif" alt="Original"></td> <td><img src="assets/videos/roof_anime.gif" alt="Character Replacement 1"></td> <td><img src="assets/videos/roof_robot.gif" alt="Character Replacement 2"></td> </tr> </table> (The above results must be generated with the assistance of FLUX.1 Kontext.)
python demo.py \
    --prompt <"prompt text"> \ # prompt text
    --checkpoint_path <model_path> \ # checkpoint path (e.g checkpoints/Diffusion-As-Shader)
    --output_dir <output_dir> \ # output directory
    --input_path <input_path> \ # the reference video path
    --repaint < True/repaint_path > \ # the repaint first frame image path of input source video or use FLUX to repaint the first frame
    --gpu <gpu_id> \ # the gpu id

2. Camera Control

<table border="1"> <tr> <th>Arc Right + Zoom out</th> <th>Arc Left + Zoom out</th> <th>Arc Up + Zoom out</th> </tr> <tr> <td><img src="assets/videos/panright+out.gif" alt="Pans Right + Zoom out"></td> <td><img src="assets/videos/panleft+out.gif" alt="Pans Left + Zoom out"></td> <td><img src="assets/videos/panup+out.gif" alt="Pans Up + Zoom out"></td> </tr> <tr> <th>Pans Left + Yaw Left</th> <th>Static</th> <th>Zoom out</th> </tr> <tr> <td><img src="assets/videos/car_panright.gif" alt="Pans Right"></td> <td><img src="assets/videos/car_static.gif" alt="Static"></td> <td><img src="assets/videos/car_zoomout.gif" alt="Zoom out"></td> </tr> </table>

We provide several template camera motion types, you can choose one of them. In practice, we find that providing a description of the camera motion in prompt will get better results.

python demo.py \
    --prompt <"prompt text"> \ # prompt text
    --checkpoint_path <model_path> \ # checkpoint path (e.g checkpoints/Diffusion-As-Shader)
    --output_dir <output_dir> \ # output directory
    --input_path <input_path> \ # the reference image or video path
    --camera_motion <camera_motion> \ # the camera motion type, see examples below
    --tracking_method <tracking_method> \ # the tracking method (moge, spatracker, cotracker). For image input, 'moge' is necessary.
    --override_extrinsics <override/append> \ # how to apply camera motion: "override" to replace original camera, "append" to build upon it
    --gpu <gpu_id> \ # the gpu id

Here are some tips for camera motion:

  • trans: translation motion, the camera will move in the direction of the vector (dx, dy, dz) with range [-1, 1]
    • Positive X: Move left, Negative X: Move right
    • Positive Y: Move down, Negative Y: Move up
    • Positive Z: Zoom in, Negative Z: Zoom out
    • e.g., 'trans -0.1 -0.1 -0.1' moving right, down and zoom in
    • e.g., 'trans -0.1 0.0 0.0 5 45' moving right 0.1 from frame 5 to 45
  • rot: rotation motion, the camera will rotate around the axis (x, y, z) by the angle
    • X-axis rotation: positive X: pitch down, negative X: pitch up
    • Y-axis rotation: positive Y: yaw left, negative Y: yaw right
    • Z-axis rotation: positive Z: roll counter-clockwise, negative Z: roll clockwise
    • e.g., 'rot y 25' rotating 25 degrees around y-axis (yaw left)
    • e.g., 'rot x -30 10 40' rotating -30 degrees around x-axis (pitch up) from frame 10 to 40
  • spiral: spiral motion, the camera will move in a spiral path with the given radius
    • e.g., 'spiral 2' spiral motion with radius 2
    • e.g., 'spiral 2 15 35' spiral motion with radius 2 from frame 15 to 35

Multiple transformations can be combined using semicolon (;) as separator:

  • e.g., "trans 0 0 -0.5 0 30; rot x -25 0 30; trans -0.1 0 0 30 48" This will:
    1. Zoom in (z-0.5) from frame 0 to 30
    2. Pitch up (rotate -25 degrees around x-axis) from frame 0 to 30
    3. Move right (x-0.1) from frame 30 to 48

Notes:

  • Frame range is 0-48 (49 frames in total)
  • If start_frame and end_frame are not specified, the motion will be applied to all frames (0-48)
  • Frames after end_frame will maintain the final transformation
  • For combined transformations, they are applied in sequence

3. Object Manipulation

<table border="1"> <tr> <th>Move Left</th> <th>Move Up</th> <th>Rotate</th> </tr> <tr> <td><img src="assets/videos/move left.gif" alt="Move Left"></td> <td><img src="assets/videos/move up.gif" alt="Move Right"></td> <td><img src="assets/videos/rotate.gif" alt="Rotate"></td> </tr> </table>

We provide several template object manipulation types, you can choose one of them. In practice, we find that providing a description of the object motion in prompt will get better results.

python demo.py \
    --prompt <"prompt text"> \ # prompt text
    --checkpoint_path <model_path> \ # checkpoint path (e.g checkpoints/Diffusion-As-Shader)
    --output_dir <output_dir> \ # output directory
    --input_path <input_path> \ # the reference image path
    --object_motion <object_motion> \ # the object motion type (up, down, left, right)
    --object_mask <object_mask_path> \ # the object mask path
    --tracking_method <tracking_method> \ # the tracking method (moge, spatracker). For image input, 'moge' is nesserary.
    --gpu <gpu_id> \ # the gpu id

Or you can create your own object motion and camera motion as follows and replace related codes in demo.py:

  1. object motion
    dict: Motion dictionary containing:
      - mask (torch.Tensor): Binary mask for selected object
      - motions (torch.Tensor): Per-frame motion vectors [49, 4, 4] (49 frames, 4x4 homogenous objects motion matrix)
    
  2. camera motion
    list: CameraMotion list containing:
      - camera_motion (list): Per-frame camera poses matrix [49, 4, 4] (49 frames, 4x4 homogenous camera poses matrix)
    

Related Skills

View on GitHub
GitHub Stars816
CategoryContent
Updated3d ago
Forks38

Languages

Python

Security Score

100/100

Audited on Mar 26, 2026

No findings