PAGE4D
[ICLR 2026] PAGE-4D: Disentangled Pose and Geometry Estimation for VGGT-4D Perception
Install / Use
/learn @kaichen-z/PAGE4DREADME
Media Lab, MIT; Harvard Medical School
Kaichen Zhou, Yuhan Wang, Grave Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang†, Mengyu Wang†
(†: Jointly Supervised)
</div>@inproceedings{zhou2025page,
title={Page-4d: Disentangled pose and geometry estimation for VGGT-4d perception},
author={Zhou, Kaichen and Wang, Yuhan and Chen, Grace and Chang, Xinhai and Beaudouin, Gaspard and Zhan, Fangneng and Liang, Paul Pu and Wang, Mengyu},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2025}
}
Overview
PAGE-4D (ICLR 2026) extends the Visual Geometry Grounded Transformer (VGGT, CVPR 2025) to dynamic scenes. It is a feed-forward neural network that directly infers key 4D scene attributes, including camera poses, depth maps, and dense point maps, while explicitly modeling dynamic elements such as moving humans and deformable objects—all without requiring post-processing or optimization.
Quick Start
First, clone this repository to your local machine, and install the dependencies (torch, torchvision, numpy, Pillow, and huggingface_hub).
git clone https://github.com/kaichen-z/PAGE4D.git
pip install -r requirements.txt
Now, try the model with just a few lines of code:
import torch
from page.models.vggt import VGGT
from page.utils.load_fn import load_and_preprocess_images
from page.utils.pose_enc import pose_encoding_to_extri_intri
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
model = VGGT()
checkpoint = torch.load(Directory, map_location=device)
model.load_state_dict(checkpoint['model'], strict=False)
image_names = ["path/to/imageA.png", "path/to/imageB.png", "path/to/imageC.png"]
images = load_and_preprocess_images(image_names).to(device)
with torch.no_grad():
with torch.cuda.amp.autocast(dtype=dtype):
predictions = model(images)
Training
Training uses launch_gra.py with gradient checkpointing for memory efficiency.
Quick start
cd training_bash
bash final_train.sh
The script runs training with automatic retries on failure and logs to logs/training_final.log.
Direct run
cd training
CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 --master_port=29508 launch_gra.py --config training_final
For multi-GPU training, set CUDA_VISIBLE_DEVICES and --nproc_per_node accordingly.
Configuration
Edit training/config/training_final.yaml to customize:
- Datasets: Training and validation datasets under
data.train.dataset.dataset_configsanddata.val.dataset.dataset_configs. Updatedataset_locationpaths for your environment. - Resume: Set
checkpoint.resume_checkpoint_pathto resume from a checkpoint. - Experiment:
exp_namecontrols checkpoint and log directory names. - Debug limits:
limit_train_batchesandlimit_val_batchescap batches per epoch; set tonullfor full training.
Update TRAINING_CMD and LOG_DIR in final_train.sh if your project path differs from the default.
Feature Map Visualization
We provide a detailed visualization strategy used in Figure 2 of our paper. The script eval/visualization.py extracts and visualizes the model's internal feature maps to illustrate how PAGE-4D disentangles frame-local and global cross-view information.
Usage
cd eval
python visualization.py
Configure at the bottom of visualization.py:
directory: Output folder for saved visualizations.image_names: List of image paths (multi-view inputs).initial_num,gap: Frame indices for video sequences (e.g.,rgb_{initial_num:05d}.jpg,rgb_{initial_num+gap:05d}.jpg).
Output
For each input image and each transformer layer, the script saves:
{name}_frame_feature_{layer}.png: Frame-local feature heatmap.{name}_global_feature_{layer}.png: Global cross-view feature heatmap.
Data-Preparation
Prepare scripts
Dataset-specific preparation scripts live in training/data/datasets/prepare/. They sample frames and produce standardized directory layouts for the dataloaders.
TUM RGB-D (tum_pre.py):
- Input: Raw TUM format with
rgb.txt,groundtruth.txt,depth.txt. - Process: Associates RGB, depth, and pose by timestamp; samples 90 frames at stride 3.
- Output per sequence:
rgb_90/,depth_90/,groundtruth_90.txt.
# Run from project root; update dataset_location in script if needed
python -m data.datasets.prepare.tum_pre
Bonn RGB-D (bonn_pre.py):
- Input:
rgbd_bonn_dataset/*/rgb/*.png,depth/*.png,groundtruth.txt. - Process: Samples frames 30–140 (110 frames) for sequences
balloon2,crowd2,crowd3,person_tracking2,synchronous. - Output per sequence:
rgb_110/,depth_110/,groundtruth_110.txt.
python -m data.datasets.prepare.bonn_pre
Update the dirs path at the top of each script to your dataset location.
Validate the dataloader
dataset_validation.py checks that the dataloader works with your config and optionally saves visualizations (point clouds, depth maps, tracks).
cd training
python -m data.dataset_validation --config debug
Enable your dataset in training/data/datasets/config/debug.yaml (or the config you pass) by uncommenting the corresponding dataset entry. The script loads the dataset via Hydra, iterates the loader, and can save:
.plypoint clouds (world and camera coordinates),- Side-by-side track visualizations,
- Depth maps,
- Track-overlay videos.
Set save_address in the script to the desired output directory.
Evaluation
We provide evaluation pipelines for monocular depth, video depth, and relative pose (camera trajectory) on dynamic scenarios. Each pipeline can be run via its run_page.sh script. Edit model_weights, datasets, and paths in the script (and in eval/eval/*/metadata.py) for your environment before running.
1. Monocular Depth (eval/eval/monodepth/)
Evaluates single-image depth estimation. Uncomment the launch_page.py block in run_page.sh to run inference first (saves depth .npy); otherwise the script runs eval_metrics.py on existing predictions (Abs Rel, Sq Rel, RMSE, δ thresholds).
# Edit model_weights, datasets in run_page.sh first
bash eval/eval/monodepth/run_page.sh
Datasets: sintel, bonn, dyncheck (edit the datasets array in the script). See metadata.py for more options.
2. Video Depth (eval/eval/video_depth/)
Evaluates depth on video sequences with sliding-window inference. Uses multi-GPU via accelerate.
# Edit model_weights, datasets in run_page.sh first
bash eval/eval/video_depth/run_page.sh
Datasets: sintel, bonn, dyncheck. Metrics: Abs Rel, Sq Rel, RMSE, Log RMSE, δ < 1.25, etc. To compute metrics after inference, uncomment and run the eval_depth.py block in the script.
3. Relative Pose (eval/eval/relpose/)
Evaluates camera trajectory (pose) estimation using evo. Outputs ATE and RPE (translation, rotation).
# Edit model_weights, datasets in run_page.sh first
bash eval/eval/relpose/run_page.sh
Datasets: sintel, tum. Outputs: pred_traj.txt, pred_focal.txt, pred_intrinsics.txt, trajectory plots, *_eval_metric.txt with ATE/RPE.
Detailed Usage
You can also optionally choose which attributes (branches) to predict, as shown below. This achieves the same result as the example above. This example uses a batch size of 1 (processing a single scene), but it naturally works for multiple scenes.
from page.utils.pose_enc import pose_encoding_to_extri_intri
with torch.no_grad():
with torch.cuda.amp.autocast(dtype=dtype):
images = images[None] # add batch dimension
aggregated_tokens_list, ps_idx = model.aggregator(images)
# Predict Cameras
pose_enc = model.camera_head(aggregated_tokens_list)[-1]
# Extrinsic and intrinsic matrices, following OpenCV convention (camera from world)
extrinsic, intrinsic = pose_encoding_to_extri_intri(pose_enc, images.shape[-2:])
# Predict Depth Maps
depth_map, depth_conf = model.depth_head(aggregated_tokens_list, images, ps_idx)
# Predict Point Maps
point_map, point_conf = model.point_head(aggregated_tokens_list, images, ps_idx)
</details>
Checkpoint
Spatial mask during training. Fine-tuning uses a learnable spatial mask in the aggregator (SpatialMaskHead_IMP in model/page/layers/block.py). Its strength is scheduled with mask_alpha(step, mask_hold_start, mask_hold_end): the mask is fully on for early optimizer steps, then its influence is reduced smoothly (cosine decay) until it is off. Set mask_hold_start / mask_hold_end in your training config (e.g. training_final.yaml under model).
At inference. With the default eval setup (mask_hold_start=mask_hold_end=0, e.g. --num_mask 0 in launch scripts), mask_alpha yields zero strength while step stays at 0, so cam_row_mask is all zeros and the attention module skips the Q/K bias augmentation (attention.py: attn_mask.any() is fa
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
