StreamVGGT
[ICLR 2026] Streaming 4D Visual Geometry Transformer
Install / Use
/learn @wzzheng/StreamVGGTREADME
Paper | Project Page | Online Demo
Streaming 4D Visual Geometry Transformer
Dong Zhuo<sup>*</sup>, Wenzhao Zheng<sup>*</sup>$\dagger$, Jiahe Guo, Yuqi Wu, Jie Zhou, Jiwen Lu
<sup>*</sup> Equal contribution. $\dagger$ Project leader.
StreamVGGT, a causal transformer architecture for real-time streaming 4D visual geometry perception compatiable with LLM-targeted attention mechanism (e.g., FlashAttention), delivers both fast inference and high-quality 4D reconstruction.
News
- [2025/7/18] Demo and checkpoints released on Hugging Face; demo code is available for local launch.
- [2025/7/15] Paper released on arXiv.
- [2025/7/14] Release the code for fine-tuning VGGT.
- [2025/7/13] Check out Point3R for another streaming 3D reconstruction work of ours!
- [2025/7/13] Distillation code for VGGT is released.
- [2025/7/13] Inference code with FlashAttention-2 is released.
- [2025/7/13] Training/evaluation code release.
Overview
Given a sequence of images, unlike offline models that require reprocessing the entire sequence and reconstructing the entire scene upon receiving each new image, our StreamVGGT employs temporal causal attention and leverages cached memory token to support efficient incremental on-the-fly reconstruction, enabling interative and real-time online applitions.
<img src="./assets/teaser_v2_01.png" alt="overview" style="width: 100%;" />On-the-Fly Online Reconstruction from Streaming Inputs
<img src="./assets/results.png" alt="overview" style="width: 100%;" />Installation
- Clone StreamVGGT
git clone https://github.com/wzzheng/StreamVGGT.git
cd StreamVGGT
- Create conda environment
conda create -n StreamVGGT python=3.11 cmake=3.14.0
conda activate StreamVGGT
- Install requirements
pip install -r requirements.txt
conda install 'llvm-openmp<16'
Download Checkpoints
Please download pretrained teacher model from here.
The checkpoint of StreamVGGT is also available at both Hugging Face and Tsinghua cloud.
Data Preparation
Training Datasets
Our training data includes 14 datasets. Please download the datasets from their official sources and refer to CUT3R for processing these datasets.
- ARKitScenes
- BlendedMVS
- CO3Dv2
- MegaDepth
- MVS-Synth
- ScanNet++
- ScanNet
- Spring
- Hypersim
- WildRGB-D
- WayMo Open dataset
- Virtual KITTI 2
- OmniObject3D
- PointOdyssey
Evaluation Datasets
Please refer to MonST3R and Spann3R to prepare Sintel, Bonn, KITTI, NYU-v2, ScanNet, 7scenes and Neural-RGBD datasets.
Folder Structure
The overall folder structure should be organized as follows:
StreamVGGT
├── ckpt/
| ├── model.pt
| └── checkpoints.pth
├── config/
| ├── ...
├── data/
│ ├── eval/
| | ├── 7scenes
| | ├── bonn
| | ├── kitti
| | ├── neural_rgbd
| | ├── nyu-v2
| | ├── scannetv2
| | └── sintel
│ ├── train/
│ │ ├── processed_arkitscenes
| | ├── ...
└── src/
├── ...
Finetuning VGGT
We also provide the following commands to fine-tune VGGT (excluding the track head) if you like.
cd src/
NCCL_DEBUG=TRACE TORCH_DISTRIBUTED_DEBUG=DETAIL HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu --main_process_port 26902 ./finetune.py --config-name finetune
Training StreamVGGT
We provide the following commands for training.
cd src/
NCCL_DEBUG=TRACE TORCH_DISTRIBUTED_DEBUG=DETAIL HYDRA_FULL_ERROR=1 accelerate launch --multi_gpu --main_process_port 26902 ./train.py --config-name train
Evaluation
The evaluation code follows MonST3R, CUT3R and VGGT.
cd src/
Monodepth
bash eval/monodepth/run.sh
Results will be saved in eval_results/monodepth/${data}_${model_name}/metric.json.
VideoDepth
bash eval/video_depth/run.sh
Results will be saved in eval_results/video_depth/${data}_${model_name}/result_scale.json.
Multi-view Reconstruction
bash eval/mv_recon/run.sh
Results will be saved in eval_results/mv_recon/${model_name}_${ckpt_name}/logs_all.txt.
Camera Pose Estimation
- Install the required dependencies:
pip install pycolmap==3.10.0 pyceres==2.3
git clone https://github.com/cvg/LightGlue.git
cd LightGlue
python -m pip install -e .
cd ..
-
Please refer to VGGT to prepare the co3d dataset.
-
Run the evaluation code:
python eval/pose_evaluation/test_co3d.py --co3d_dir /YOUR/CO3D/PATH --co3d_anno_dir /YOUR/CO3D/ANNO/PATH --seed 0
Demo
We provide a demo for StreamVGGT, based on the demo code from VGGT. You can follow the instructions below to launch it locally or try it out directly on Hugging Face.
pip install -r requirements_demo.txt
python demo_gradio.py
Note: While StreamVGGT typically reconstructs a scene in under one second, 3D point visualization may take much longer due to slower third-party rendering.
Acknowledgements
Our code is based on the following brilliant repositories:
DUSt3R MonST3R Spann3R CUT3R VGGT Point3R
Many thanks to these authors!
Citation
If you find this project helpful, please consider citing the following paper:
@article{streamVGGT,
title={Streaming 4D Visual Geometry Transformer},
author={Dong Zhuo and Wenzhao Zheng and Jiahe Guo and Yuqi Wu and Jie Zhou and Jiwen Lu},
journal={arXiv preprint arXiv:2507.11539},
year={2025}
}
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
