StreamVGGT

[ICLR 2026] Streaming 4D Visual Geometry Transformer

Generate Convert Improve

Install / Use

/learn @wzzheng/StreamVGGT

About this skill

Quality Score

0/100

README

<div align="center"> <h1>Streaming 4D Visual Geometry Transformer</h1> </div>

Paper | Project Page | Online Demo

Streaming 4D Visual Geometry Transformer

Dong Zhuo<sup>*</sup>, Wenzhao Zheng<sup>*</sup>$\dagger$, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu

<sup>*</sup> Equal contribution. $\dagger$ Project leader.

StreamVGGT, a causal transformer architecture for real-time streaming 4D visual geometry perception compatiable with LLM-targeted attention mechanism (e.g., FlashAttention), delivers both fast inference and high-quality 4D reconstruction.

News

[2025/7/18] Demo and checkpoints released on Hugging Face; demo code is available for local launch.
[2025/7/15] Paper released on arXiv.
[2025/7/14] Release the code for fine-tuning VGGT.
[2025/7/13] Check out Point3R for another streaming 3D reconstruction work of ours!
[2025/7/13] Distillation code for VGGT is released.
[2025/7/13] Inference code with FlashAttention-2 is released.
[2025/7/13] Training/evaluation code release.

Overview

Given a sequence of images, unlike offline models that require reprocessing the entire sequence and reconstructing the entire scene upon receiving each new image, our StreamVGGT employs temporal causal attention and leverages cached memory token to support efficient incremental on-the-fly reconstruction, enabling interative and real-time online applitions.

On-the-Fly Online Reconstruction from Streaming Inputs

Installation

Clone StreamVGGT

git clone https://github.com/wzzheng/StreamVGGT.git
cd StreamVGGT

Create conda environment

conda create -n StreamVGGT python=3.11 cmake=3.14.0
conda activate StreamVGGT

Install requirements

pip install -r requirements.txt
conda install 'llvm-openmp<16'

Download Checkpoints

Please download pretrained teacher model from here.

The checkpoint of StreamVGGT is also available at both Hugging Face and Tsinghua cloud.

Data Preparation

Training Datasets

Our training data includes 14 datasets. Please download the datasets from their official sources and refer to CUT3R for processing these datasets.

Evaluation Datasets

Please refer to MonST3R and Spann3R to prepare Sintel, Bonn, KITTI, NYU-v2, ScanNet, 7scenes and Neural-RGBD datasets.

Folder Structure

The overall folder structure should be organized as follows：

StreamVGGT
├── ckpt/
|   ├── model.pt
|   └── checkpoints.pth
├── config/
|   ├── ...
├── data/
│   ├── eval/
|   |   ├── 7scenes
|   |   ├── bonn
|   |   ├── kitti
|   |   ├── neural_rgbd
|   |   ├── nyu-v2
|   |   ├── scannetv2
|   |   └── sintel
│   ├── train/
│   │   ├── processed_arkitscenes
|   |   ├── ...
└── src/
    ├── ...

Finetuning VGGT

We also provide the following commands to fine-tune VGGT (excluding the track head) if you like.

cd src/
NCCL_DEBUG=TRACE TORCH_DISTRIBUTED_DEBUG=DETAIL HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu --main_process_port 26902 ./finetune.py --config-name finetune

Training StreamVGGT

We provide the following commands for training.

cd src/
NCCL_DEBUG=TRACE TORCH_DISTRIBUTED_DEBUG=DETAIL HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu --main_process_port 26902 ./train.py --config-name train

Evaluation

The evaluation code follows MonST3R, CUT3R and VGGT.

cd src/

Monodepth

bash eval/monodepth/run.sh

Results will be saved in eval_results/monodepth/${data}_${model_name}/metric.json.

VideoDepth

bash eval/video_depth/run.sh

Results will be saved in eval_results/video_depth/${data}_${model_name}/result_scale.json.

Multi-view Reconstruction

bash eval/mv_recon/run.sh

Results will be saved in eval_results/mv_recon/${model_name}_${ckpt_name}/logs_all.txt.

Camera Pose Estimation

Install the required dependencies:

pip install pycolmap==3.10.0 pyceres==2.3
git clone https://github.com/cvg/LightGlue.git
cd LightGlue
python -m pip install -e .
cd ..

Please refer to VGGT to prepare the co3d dataset.
Run the evaluation code:

python eval/pose_evaluation/test_co3d.py --co3d_dir /YOUR/CO3D/PATH --co3d_anno_dir /YOUR/CO3D/ANNO/PATH --seed 0

Demo

We provide a demo for StreamVGGT, based on the demo code from VGGT. You can follow the instructions below to launch it locally or try it out directly on Hugging Face.

pip install -r requirements_demo.txt
python demo_gradio.py

Note: While StreamVGGT typically reconstructs a scene in under one second, 3D point visualization may take much longer due to slower third-party rendering.

Acknowledgements

Our code is based on the following brilliant repositories:

DUSt3R MonST3R Spann3R CUT3R VGGT Point3R

Many thanks to these authors!

Citation

If you find this project helpful, please consider citing the following paper:

@article{streamVGGT,
      title={Streaming 4D Visual Geometry Transformer}, 
      author={Dong Zhuo and Wenzhao Zheng and Jiahe Guo and Yuqi Wu and Jie Zhou and Jiwen Lu},
      journal={arXiv preprint arXiv:2507.11539},
      year={2025}
}

Related Skills

node-connect

352.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。