SkillAgentSearch skills...

UCPE

πŸ“· [CVPR'26] Camera-controlled text-to-video generation, now with intrinsics, distortion and orientation control!

Install / Use

/learn @chengzhag/UCPE
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

πŸ“· UCPE

<p align="center"> <h1 align="center">Unified Camera Positional Encoding for Controlled Video Generation</h1> <p align="center"> <p align="center"> <a href="https://chengzhag.github.io/">Cheng Zhang</a><sup>1</sup><sup>,2</sup> Β· <a href="https://leeby68.github.io/">Boying Li</a><sup>1</sup> Β· <a href="https://www.linkedin.com/in/meng-wei-66687a105/?originalSubdomain=au">Meng Wei</a><sup>1</sup> Β· <a href="https://yanpei.me/">Yan-Pei Cao</a><sup>3</sup> Β· <a href="https://www.monash.edu/mada/architecture/people/camilo-cruz-gambardella/">Camilo Cruz Gambardella</a><sup>1,2</sup> Β· <a href="https://research.monash.edu/en/persons/dinh-phung/">Dinh Phung</a><sup>1</sup> Β· <a href="https://jianfei-cai.github.io/">Jianfei Cai</a><sup>1</sup><br> <sup>1</sup>Monash University <sup>2</sup>Building 4.0 CRC <sup>3</sup>VAST </p> <h2 align="center"><a href="https://arxiv.org/abs/2512.07237">Paper</a> | <a href="https://chengzhag.github.io/publication/ucpe/">Project Page</a> | <a href="https://youtu.be/rMX7gxH8jBM">Video</a> | <a href="https://huggingface.co/datasets/chengzhag/PanShot">Hugging Face</a></h2> </p>

Watch the video *Our UCPE introduces a geometry-consistent alternative to PlΓΌcker rays as one of the core contributions, enabling better generalization in Transformers. We hope to inspire future research on camera-aware architectures.

πŸ“’ Updates

  • [2026.03.19] πŸ”§ Fixed a bug in PlΓΌcker encoding (thanks to @fengq1a0's issue #5).
  • [2026.02.21] πŸŽ‰ UCPE accepted to CVPR 2026
  • [2026.02.04] πŸ“ PanShot Dataset And Curation Code (controllable camera data synthesized from PanFlow)
  • [2026.02.04] 🎯 Full Training, Evaluation, Visualization Code
  • [2025.12.07] ⚑ Quick Demo code released

πŸš€ TLDR

πŸ”₯ Camera-controlled text-to-video generation, now with intrinsics, distortion and orientation control!

<p align="center"> <img src="images/cameras.png" alt="Camera lenses" height="120px"> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <img src="images/orientation.png" alt="Orientation control" height="140px"> </p>

πŸ“· UCPE integrates Relative Ray Encodingβ€”which delivers significantly better generalization than PlΓΌcker across diverse camera motion, intrinsics and lens distortionsβ€”with Absolute Orientation Encoding for controllable pitch and roll, enabling a unified camera representation for Transformers and state-of-the-art camera-controlled video generation with just 0.5% extra parameters (35.5M over the 7.3B parameters of the base model)

<p align="center"> <img src="images/video-ucpe.gif" alt="UCPE" style="max-height:480px; width:auto;"> </p>

πŸ› οΈ Installation

conda create -n UCPE python=3.11 -y
conda activate UCPE
conda install -c conda-forge "ffmpeg<8" libiconv libgl -y
pip install -r requirements.txt
pip install --no-build-isolation --no-cache-dir flash-attn==2.8.0.post2
pip install -e .

cd thirdparty/equilib
pip install -e .

We use wandb to log and visualize the training process. You can create an account then login to wandb by running the following command:

wandb login
<details> <summary>Below are installations for tools used in evaluation and dataset processing that can be skipped if you do not need these tools.</summary>
cd ../GeoCalib
pip install -e .
pip install -e siclib

cd ../UniK3D
pip install -e . --extra-index-url https://download.pytorch.org/whl/cu121

cd ../Q-Align
conda create -n qalign python=3.9 -y
conda activate qalign
pip install -e .
pip install jsonlines "numpy<2" protobuf pydantic-settings

cd ../vipe
conda env create -f envs/base.yml
conda activate vipe
pip install -r envs/requirements.txt
pip install --no-build-isolation -e .
</details> <br>

⚑ Quick Demo

Download our finetuned weights from OneDrive and put it in logs/ folder. Then run:

bash scripts/demo.sh

The generated videos will be saved in logs/6wodf04s/demo, examples shown below:

  • demo/lens.json: Our Relative Ray Encoding not only generalizes to but also enables controllability over a wide range of camera intrinsics and lens distortions.
<p align="center"> <img src="images/video-lens.gif" alt="Lens control" style="max-height:480px; width:auto;"> </p>
  • demo/pose.json: The geometry-consistent design of Relative Ray Encoding further allows strong generalization and controllability over diverse camera motions.
<p align="center"> <img src="images/video-pose.gif" alt="Pose control" style="max-height:480px; width:auto;"> </p>
  • demo/teaser.json: Our Absolute Orientation Encoding further eliminate the ambiguity in pitch and roll in previous T2V methods, enabling precise control over initial camera orientation.
<p align="center"> <img src="images/video-orientation.gif" alt="Orientation control" style="max-height:480px; width:auto;"> </p>

🌏 PanShot Dataset

Please download the PanShot dataset from Hugging Face to data/UCPE/PanShot-7z by:

huggingface-cli download chengzhag/PanShot --repo-type dataset --local-dir data/UCPE/PanShot-7z

Then extract the dataset by:

cd data/UCPE/PanShot-7z
bash extract_panshot.sh
cd ../../..

The extracted dataset will be saved in data/UCPE/PanShot. Please then copy the other files to form the following folder structure:

β”œβ”€β”€ captioned-test.jsonl
β”œβ”€β”€ captioned-train.jsonl
β”œβ”€β”€ max_rotation-test.json
β”œβ”€β”€ meta-test
β”œβ”€β”€ meta-train
β”œβ”€β”€ pose-test
β”œβ”€β”€ pose-train
β”œβ”€β”€ videos-test
└── videos-train
<details> <summary>If you want to go through the dataset curation process, Please follow these three steps.</summary>

CameraBench

Download the dataset from multiple sources:

cd data
huggingface-cli download --repo-type dataset syCen/CameraBench --local-dir CameraBench
cd CameraBench
huggingface-cli download --repo-type dataset syCen/Videos4CameraBnech --local-dir data/videos
wget https://huggingface.co/datasets/chancharikm/cambench_train_videos/resolve/main/videos.zip
unzip videos.zip -d videos
cd ../..

Process the dataset:

conda activate UCPE
python tools/process_camerabench.py  # set split = "train" and split = "test"

conda activate vipe
cd thirdparty/vipe
python thirdparty/vipe/run.py pipeline=default streams=raw_mp4_stream streams.base_path=data/UCPE/CameraBench/videos/ pipeline.output.path=data/UCPE/CameraBench/vipe/ pipeline.output.save_artifacts=true pipeline.post.depth_align_model=null

conda activate UCPE
python tools/geocalib_camerabench.py
python tools/filter_camerabench.py

Processed dataset will be saved in data/UCPE/CameraBench.

PanFlow

Download the pretrained model PanoFlow(RAFT)-wo-CFE.pth of Panoflow at weiyun, then put it in models/PanoFlow folder.

Our PanShot dataset is built upon PanFlow dataset's videos and slam_poses. Please download follow their instructions on how to download the full videos and download their meta and slam_poses files following Full Dataset.

Then process the dataset with:

conda activate UCPE
python tools/filter_panflow.py

conda activate qalign
python tools/score_panflow.py

conda activate UCPE
python tools/align_panflow.py  # set split = "train" and split = "test"
python tools/match_panflow.py  # set split = "train" and split = "test"
python tools/normalize_panflow.py  # set split = "train" and split = "test"

PanShot

Export your YouTube cookies to ~/.config/cookies.txt in Netscape format for 4k download. Then download and process the dataset:

conda activate UCPE
python tools/process_panshot.py  # set split = "train" and split = "test"
python tools/caption_panshot.py  # set split = "train" and split = "test"
</details> <br>

🏑 RealEstate10k Dataset

We use RealEstate10k Dataset for evaluation, so only poses and captions are needed. Plesae download the RealEstate10k poses from the official website (RealEstate10K.tgz) and unpack it to data/RealEstate10k folder. Then download the captions from CameraCtrl (train and test)

The final folder structure should be like this:

β”œβ”€β”€ captions
β”‚   β”œβ”€β”€ test.json
β”‚   └── train.json
β”œβ”€β”€ pose_files
β”‚   β”œβ”€β”€ test
β”‚   └── train
└── traj_normalization.txt

🎯 Training and Evaluation

Prepare the latent cache and train the model with:

python src/cache.py
bash scripts/train.sh

We used 8 A800 GPUs for training, which takes about 1 day. You'll get a WANDB_RUN_ID (e.g., 6wodf04s) after starting the training. The logs will be synced to your wandb account and the checkpoints will be saved in logs/<WANDB_RUN_ID>/checkpoints/. You can use other commented settings in scripts/train.sh for ablation studies and baselines.

For evaluation, first download the pretrained model i3d_pretrained_400.pt in common_metrics_on_video_quality, then put it in models/FVD folder. Evaluate results with:

bash scripts/evaluate.sh

Please change the `WANDB_RUN_

View on GitHub
GitHub Stars137
CategoryContent
Updated2d ago
Forks1

Languages

Python

Security Score

80/100

Audited on Mar 19, 2026

No findings