📷 UCPE

<h1 align="center">Unified Camera Positional Encoding for Controlled Video Generation</h1> <a href="https://chengzhag.github.io/">Cheng Zhang</a>1,2 · <a href="https://leeby68.github.io/">Boying Li</a>1 · <a href="https://www.linkedin.com/in/meng-wei-66687a105/?originalSubdomain=au">Meng Wei</a>1 · <a href="https://yanpei.me/">Yan-Pei Cao</a>3 · <a href="https://www.monash.edu/mada/architecture/people/camilo-cruz-gambardella/">Camilo Cruz Gambardella</a>1,2 · <a href="https://research.monash.edu/en/persons/dinh-phung/">Dinh Phung</a>1 · <a href="https://jianfei-cai.github.io/">Jianfei Cai</a>1 1Monash University 2Building 4.0 CRC 3VAST <h2 align="center"><a href="https://arxiv.org/abs/2512.07237">Paper</a> | <a href="https://chengzhag.github.io/publication/ucpe/">Project Page</a> | <a href="https://youtu.be/rMX7gxH8jBM">Video</a> | <a href="https://huggingface.co/datasets/chengzhag/PanShot">Hugging Face</a></h2>

*Our UCPE introduces a geometry-consistent alternative to Plücker rays as one of the core contributions, enabling better generalization in Transformers. We hope to inspire future research on camera-aware architectures.

📢 Updates

[2026.03.19] 🔧 Fixed a bug in Plücker encoding (thanks to @fengq1a0's issue #5).
[2026.02.21] 🎉 UCPE accepted to CVPR 2026
[2026.02.04] 📁 PanShot Dataset And Curation Code (controllable camera data synthesized from PanFlow)
[2026.02.04] 🎯 Full Training, Evaluation, Visualization Code
[2025.12.07] ⚡ Quick Demo code released

🚀 TLDR

🔥 Camera-controlled text-to-video generation, now with intrinsics, distortion and orientation control!

📷 UCPE integrates Relative Ray Encoding—which delivers significantly better generalization than Plücker across diverse camera motion, intrinsics and lens distortions—with Absolute Orientation Encoding for controllable pitch and roll, enabling a unified camera representation for Transformers and state-of-the-art camera-controlled video generation with just 0.5% extra parameters (35.5M over the 7.3B parameters of the base model)

🛠️ Installation

conda create -n UCPE python=3.11 -y
conda activate UCPE
conda install -c conda-forge "ffmpeg<8" libiconv libgl -y
pip install -r requirements.txt
pip install --no-build-isolation --no-cache-dir flash-attn==2.8.0.post2
pip install -e .

cd thirdparty/equilib
pip install -e .

We use wandb to log and visualize the training process. You can create an account then login to wandb by running the following command:

wandb login

<details> <summary>Below are installations for tools used in evaluation and dataset processing that can be skipped if you do not need these tools.</summary>

cd ../GeoCalib
pip install -e .
pip install -e siclib

cd ../UniK3D
pip install -e . --extra-index-url https://download.pytorch.org/whl/cu121

cd ../Q-Align
conda create -n qalign python=3.9 -y
conda activate qalign
pip install -e .
pip install jsonlines "numpy<2" protobuf pydantic-settings

cd ../vipe
conda env create -f envs/base.yml
conda activate vipe
pip install -r envs/requirements.txt
pip install --no-build-isolation -e .

</details>

⚡ Quick Demo

Download our finetuned weights from OneDrive and put it in logs/ folder. Then run:

bash scripts/demo.sh

The generated videos will be saved in logs/6wodf04s/demo, examples shown below:

demo/lens.json: Our Relative Ray Encoding not only generalizes to but also enables controllability over a wide range of camera intrinsics and lens distortions.

demo/pose.json: The geometry-consistent design of Relative Ray Encoding further allows strong generalization and controllability over diverse camera motions.

demo/teaser.json: Our Absolute Orientation Encoding further eliminate the ambiguity in pitch and roll in previous T2V methods, enabling precise control over initial camera orientation.

🌏 PanShot Dataset

Please download the PanShot dataset from Hugging Face to data/UCPE/PanShot-7z by:

huggingface-cli download chengzhag/PanShot --repo-type dataset --local-dir data/UCPE/PanShot-7z

Then extract the dataset by:

cd data/UCPE/PanShot-7z
bash extract_panshot.sh
cd ../../..

The extracted dataset will be saved in data/UCPE/PanShot. Please then copy the other files to form the following folder structure:

├── captioned-test.jsonl
├── captioned-train.jsonl
├── max_rotation-test.json
├── meta-test
├── meta-train
├── pose-test
├── pose-train
├── videos-test
└── videos-train

<details> <summary>If you want to go through the dataset curation process, Please follow these three steps.</summary>

CameraBench

Download the dataset from multiple sources:

cd data
huggingface-cli download --repo-type dataset syCen/CameraBench --local-dir CameraBench
cd CameraBench
huggingface-cli download --repo-type dataset syCen/Videos4CameraBnech --local-dir data/videos
wget https://huggingface.co/datasets/chancharikm/cambench_train_videos/resolve/main/videos.zip
unzip videos.zip -d videos
cd ../..

Process the dataset:

conda activate UCPE
python tools/process_camerabench.py  # set split = "train" and split = "test"

conda activate vipe
cd thirdparty/vipe
python thirdparty/vipe/run.py pipeline=default streams=raw_mp4_stream streams.base_path=data/UCPE/CameraBench/videos/ pipeline.output.path=data/UCPE/CameraBench/vipe/ pipeline.output.save_artifacts=true pipeline.post.depth_align_model=null

conda activate UCPE
python tools/geocalib_camerabench.py
python tools/filter_camerabench.py

Processed dataset will be saved in data/UCPE/CameraBench.

PanFlow

Download the pretrained model PanoFlow(RAFT)-wo-CFE.pth of Panoflow at weiyun, then put it in models/PanoFlow folder.

Our PanShot dataset is built upon PanFlow dataset's videos and slam_poses. Please download follow their instructions on how to download the full videos and download their meta and slam_poses files following Full Dataset.

Then process the dataset with:

conda activate UCPE
python tools/filter_panflow.py

conda activate qalign
python tools/score_panflow.py

conda activate UCPE
python tools/align_panflow.py  # set split = "train" and split = "test"
python tools/match_panflow.py  # set split = "train" and split = "test"
python tools/normalize_panflow.py  # set split = "train" and split = "test"

PanShot

Export your YouTube cookies to ~/.config/cookies.txt in Netscape format for 4k download. Then download and process the dataset:

conda activate UCPE
python tools/process_panshot.py  # set split = "train" and split = "test"
python tools/caption_panshot.py  # set split = "train" and split = "test"

</details>

🏡 RealEstate10k Dataset

We use RealEstate10k Dataset for evaluation, so only poses and captions are needed. Plesae download the RealEstate10k poses from the official website (RealEstate10K.tgz) and unpack it to data/RealEstate10k folder. Then download the captions from CameraCtrl (train and test)

The final folder structure should be like this:

├── captions
│   ├── test.json
│   └── train.json
├── pose_files
│   ├── test
│   └── train
└── traj_normalization.txt

🎯 Training and Evaluation

Prepare the latent cache and train the model with:

python src/cache.py
bash scripts/train.sh

We used 8 A800 GPUs for training, which takes about 1 day. You'll get a WANDB_RUN_ID (e.g., 6wodf04s) after starting the training. The logs will be synced to your wandb account and the checkpoints will be saved in logs/<WANDB_RUN_ID>/checkpoints/. You can use other commented settings in scripts/train.sh for ablation studies and baselines.

For evaluation, first download the pretrained model i3d_pretrained_400.pt in common_metrics_on_video_quality, then put it in models/FVD folder. Evaluate results with:

bash scripts/evaluate.sh

Please change the `WANDB_RUN_

UCPE

Install / Use

README