SkillAgentSearch skills...

PAGE4D

[ICLR 2026] PAGE-4D: Disentangled Pose and Geometry Estimation for VGGT-4D Perception

Install / Use

/learn @kaichen-z/PAGE4D
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <h1>Page-4d: Disentangled pose and geometry estimation for VGGT-4d perception</h1> <a href="https://openreview.net/pdf?id=Nfmzp5PBzr" target="_blank" rel="noopener noreferrer"> <img src="https://img.shields.io/badge/Paper-VGGT" alt="Paper PDF"></a> <a href="https://arxiv.org/pdf/2510.17568"><img src="https://img.shields.io/badge/arXiv-2510.17568-b31b1b" alt="arXiv"></a> <a href="https://page4d.github.io/"><img src="https://img.shields.io/badge/Project_Page-green" alt="Project Page"></a>

Media Lab, MIT; Harvard Medical School

Kaichen Zhou, Yuhan Wang, Grave Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang†, Mengyu Wang†

(†: Jointly Supervised)

</div>
@inproceedings{zhou2025page,
  title={Page-4d: Disentangled pose and geometry estimation for VGGT-4d perception},
  author={Zhou, Kaichen and Wang, Yuhan and Chen, Grace and Chang, Xinhai and Beaudouin, Gaspard and Zhan, Fangneng and Liang, Paul Pu and Wang, Mengyu},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2025}
}

Overview

PAGE-4D (ICLR 2026) extends the Visual Geometry Grounded Transformer (VGGT, CVPR 2025) to dynamic scenes. It is a feed-forward neural network that directly infers key 4D scene attributes, including camera poses, depth maps, and dense point maps, while explicitly modeling dynamic elements such as moving humans and deformable objects—all without requiring post-processing or optimization.

Quick Start

First, clone this repository to your local machine, and install the dependencies (torch, torchvision, numpy, Pillow, and huggingface_hub).

git clone https://github.com/kaichen-z/PAGE4D.git
pip install -r requirements.txt

Now, try the model with just a few lines of code:

import torch
from page.models.vggt import VGGT
from page.utils.load_fn import load_and_preprocess_images
from page.utils.pose_enc import pose_encoding_to_extri_intri
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16
model = VGGT()
checkpoint = torch.load(Directory, map_location=device)
model.load_state_dict(checkpoint['model'], strict=False)
image_names = ["path/to/imageA.png", "path/to/imageB.png", "path/to/imageC.png"]  
images = load_and_preprocess_images(image_names).to(device)
with torch.no_grad():
    with torch.cuda.amp.autocast(dtype=dtype):
        predictions = model(images)

Training

Training uses launch_gra.py with gradient checkpointing for memory efficiency.

Quick start

cd training_bash
bash final_train.sh

The script runs training with automatic retries on failure and logs to logs/training_final.log.

Direct run

cd training
CUDA_VISIBLE_DEVICES=0 torchrun --nproc_per_node=1 --master_port=29508 launch_gra.py --config training_final

For multi-GPU training, set CUDA_VISIBLE_DEVICES and --nproc_per_node accordingly.

Configuration

Edit training/config/training_final.yaml to customize:

  • Datasets: Training and validation datasets under data.train.dataset.dataset_configs and data.val.dataset.dataset_configs. Update dataset_location paths for your environment.
  • Resume: Set checkpoint.resume_checkpoint_path to resume from a checkpoint.
  • Experiment: exp_name controls checkpoint and log directory names.
  • Debug limits: limit_train_batches and limit_val_batches cap batches per epoch; set to null for full training.

Update TRAINING_CMD and LOG_DIR in final_train.sh if your project path differs from the default.

Feature Map Visualization

We provide a detailed visualization strategy used in Figure 2 of our paper. The script eval/visualization.py extracts and visualizes the model's internal feature maps to illustrate how PAGE-4D disentangles frame-local and global cross-view information.

Usage

cd eval
python visualization.py

Configure at the bottom of visualization.py:

  • directory: Output folder for saved visualizations.
  • image_names: List of image paths (multi-view inputs).
  • initial_num, gap: Frame indices for video sequences (e.g., rgb_{initial_num:05d}.jpg, rgb_{initial_num+gap:05d}.jpg).

Output

For each input image and each transformer layer, the script saves:

  • {name}_frame_feature_{layer}.png: Frame-local feature heatmap.
  • {name}_global_feature_{layer}.png: Global cross-view feature heatmap.

Data-Preparation

Prepare scripts

Dataset-specific preparation scripts live in training/data/datasets/prepare/. They sample frames and produce standardized directory layouts for the dataloaders.

TUM RGB-D (tum_pre.py):

  • Input: Raw TUM format with rgb.txt, groundtruth.txt, depth.txt.
  • Process: Associates RGB, depth, and pose by timestamp; samples 90 frames at stride 3.
  • Output per sequence: rgb_90/, depth_90/, groundtruth_90.txt.
# Run from project root; update dataset_location in script if needed
python -m data.datasets.prepare.tum_pre

Bonn RGB-D (bonn_pre.py):

  • Input: rgbd_bonn_dataset/*/rgb/*.png, depth/*.png, groundtruth.txt.
  • Process: Samples frames 30–140 (110 frames) for sequences balloon2, crowd2, crowd3, person_tracking2, synchronous.
  • Output per sequence: rgb_110/, depth_110/, groundtruth_110.txt.
python -m data.datasets.prepare.bonn_pre

Update the dirs path at the top of each script to your dataset location.

Validate the dataloader

dataset_validation.py checks that the dataloader works with your config and optionally saves visualizations (point clouds, depth maps, tracks).

cd training
python -m data.dataset_validation --config debug

Enable your dataset in training/data/datasets/config/debug.yaml (or the config you pass) by uncommenting the corresponding dataset entry. The script loads the dataset via Hydra, iterates the loader, and can save:

  • .ply point clouds (world and camera coordinates),
  • Side-by-side track visualizations,
  • Depth maps,
  • Track-overlay videos.

Set save_address in the script to the desired output directory.

Evaluation

We provide evaluation pipelines for monocular depth, video depth, and relative pose (camera trajectory) on dynamic scenarios. Each pipeline can be run via its run_page.sh script. Edit model_weights, datasets, and paths in the script (and in eval/eval/*/metadata.py) for your environment before running.

1. Monocular Depth (eval/eval/monodepth/)

Evaluates single-image depth estimation. Uncomment the launch_page.py block in run_page.sh to run inference first (saves depth .npy); otherwise the script runs eval_metrics.py on existing predictions (Abs Rel, Sq Rel, RMSE, δ thresholds).

# Edit model_weights, datasets in run_page.sh first
bash eval/eval/monodepth/run_page.sh

Datasets: sintel, bonn, dyncheck (edit the datasets array in the script). See metadata.py for more options.

2. Video Depth (eval/eval/video_depth/)

Evaluates depth on video sequences with sliding-window inference. Uses multi-GPU via accelerate.

# Edit model_weights, datasets in run_page.sh first
bash eval/eval/video_depth/run_page.sh

Datasets: sintel, bonn, dyncheck. Metrics: Abs Rel, Sq Rel, RMSE, Log RMSE, δ < 1.25, etc. To compute metrics after inference, uncomment and run the eval_depth.py block in the script.

3. Relative Pose (eval/eval/relpose/)

Evaluates camera trajectory (pose) estimation using evo. Outputs ATE and RPE (translation, rotation).

# Edit model_weights, datasets in run_page.sh first
bash eval/eval/relpose/run_page.sh

Datasets: sintel, tum. Outputs: pred_traj.txt, pred_focal.txt, pred_intrinsics.txt, trajectory plots, *_eval_metric.txt with ATE/RPE.

Detailed Usage

You can also optionally choose which attributes (branches) to predict, as shown below. This achieves the same result as the example above. This example uses a batch size of 1 (processing a single scene), but it naturally works for multiple scenes.

from page.utils.pose_enc import pose_encoding_to_extri_intri
with torch.no_grad():
    with torch.cuda.amp.autocast(dtype=dtype):
        images = images[None]  # add batch dimension
        aggregated_tokens_list, ps_idx = model.aggregator(images)
    # Predict Cameras
    pose_enc = model.camera_head(aggregated_tokens_list)[-1]
    # Extrinsic and intrinsic matrices, following OpenCV convention (camera from world)
    extrinsic, intrinsic = pose_encoding_to_extri_intri(pose_enc, images.shape[-2:])
    # Predict Depth Maps
    depth_map, depth_conf = model.depth_head(aggregated_tokens_list, images, ps_idx)
    # Predict Point Maps
    point_map, point_conf = model.point_head(aggregated_tokens_list, images, ps_idx)
</details>

Checkpoint

Spatial mask during training. Fine-tuning uses a learnable spatial mask in the aggregator (SpatialMaskHead_IMP in model/page/layers/block.py). Its strength is scheduled with mask_alpha(step, mask_hold_start, mask_hold_end): the mask is fully on for early optimizer steps, then its influence is reduced smoothly (cosine decay) until it is off. Set mask_hold_start / mask_hold_end in your training config (e.g. training_final.yaml under model).

At inference. With the default eval setup (mask_hold_start=mask_hold_end=0, e.g. --num_mask 0 in launch scripts), mask_alpha yields zero strength while step stays at 0, so cam_row_mask is all zeros and the attention module skips the Q/K bias augmentation (attention.py: attn_mask.any() is fa

Related Skills

View on GitHub
GitHub Stars114
CategoryDevelopment
Updated4h ago
Forks10

Languages

Python

Security Score

80/100

Audited on Apr 8, 2026

No findings