STream3R
Dynamic 3D Foundation Model using Causal Transformer. [ICLR 2026]
Install / Use
/learn @NIRVANALAN/STream3RREADME
:fire: News
- [Mar 9, 2026] Check out the DUSt3R-based metric-scale STream3R[α] version on the alpha branch.
- [Jan 26, 2026] Accepted to ICLR 2026!
- [Sep 16, 2025] The complete training code is released!
- [Aug 22, 2025] The evaluation code is now available!
- [Aug 15, 2025] Our inference code and weights are released!
🔧 Installation
-
Clone Repo
git clone https://github.com/NIRVANALAN/STream3R cd STream3R -
Create Conda Environment
conda create -n stream3r python=3.11 cmake=3.14.0 -y conda activate stream3r -
Install Python Dependencies
Important: Install Torch based on your CUDA version. For example, for Torch 2.8.0 + CUDA 12.6:
# Install Torch pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126 # Install other dependencies pip install -r requirements.txt # Install STream3R as a package pip install -e .
:computer: Inference
You can now try STream3R with the following code. The checkpoint will be downloaded automatically from Hugging Face.
You can set the inference mode to causal for causal attention, window for sliding window attention (with a default window size of 5), or full for bidirectional attention.
import os
import torch
from stream3r.models.stream3r import STream3R
from stream3r.models.components.utils.load_fn import load_and_preprocess_images
device = "cuda" if torch.cuda.is_available() else "cpu"
model = STream3R.from_pretrained("yslan/STream3R").to(device)
model.eval()
example_dir = "examples/static_room"
image_names = [os.path.join(example_dir, file) for file in sorted(os.listdir(example_dir))]
images = load_and_preprocess_images(image_names).to(device)
with torch.no_grad():
# Use one mode "causal", "window", or "full" in a single forward pass
predictions = model(images, mode="causal")
A script is already at inference_stream3r.py.
We also support a KV cache version to enable streaming input using StreamSession. The StreamSession takes sequential input and processes them one by one, making it suitable for real-time or low-latency applications. This streaming 3D reconstruction pipeline can be applied in various scenarios such as real-time robotics, autonomous navigation, online 3D understanding and SLAM. An example usage is shown below:
import os
import torch
from stream3r.models.stream3r import STream3R
from stream3r.stream_session import StreamSession
from stream3r.models.components.utils.load_fn import load_and_preprocess_images
device = "cuda" if torch.cuda.is_available() else "cpu"
model = STream3R.from_pretrained("yslan/STream3R").to(device)
example_dir = "examples/static_room"
image_names = [os.path.join(example_dir, file) for file in sorted(os.listdir(example_dir))]
images = load_and_preprocess_images(image_names).to(device)
# StreamSession supports KV cache management for both "causal" and "window" modes.
session = StreamSession(model, mode="causal")
with torch.no_grad():
# Process images one by one to simulate streaming inference
for i in range(images.shape[0]):
image = images[i : i + 1]
predictions = session.forward_stream(image)
session.clear()
:zap: Demo
You can run the demo built on VGG-T's code using the script app.py with the following command:
python app.py
📁 Code Structure
The repository is structured as follows:
STream3R/
├── stream3r/
│ ├── models/
│ │ ├── stream3r.py
│ │ ├── multiview_dust3r_module.py
│ │ └── components/
│ ├── dust3r/
│ ├── croco/
│ ├── utils/
│ └── stream_session.py
├── configs/
├── examples/
├── assets/
├── app.py
├── requirements.txt
├── setup.py
└── README.md
:100: Quantitive Results
3D Reconstruction Comparison on NRGBD.
| Method | Type | Acc Mean ↓ | Acc Med. ↓ | Comp Mean ↓ | Comp Med. ↓ | NC Mean ↑ | NC Med. ↑ | |---------------------|----------|------------|------------|-------------|-------------|-----------|-----------| | VGG-T | FA | 0.073 | 0.018 | 0.077 | 0.021 | 0.910 | 0.990 | | DUSt3R | Optim | 0.144 | 0.019 | 0.154 | 0.018 | 0.870 | 0.982 | | MASt3R | Optim | 0.085 | 0.033 | 0.063 | 0.028 | 0.794 | 0.928 | | MonST3R | Optim | 0.272 | 0.114 | 0.287 | 0.110 | 0.758 | 0.843 | | Spann3R | Stream | 0.416 | 0.323 | 0.417 | 0.285 | 0.684 | 0.789 | | CUT3R | Stream | 0.099 | 0.031 | 0.076 | 0.026 | 0.837 | 0.971 | | StreamVGGT | Stream | 0.084 | 0.044 | 0.074 | 0.041 | 0.861 | 0.986 | | Ours | Stream | 0.057 | 0.014 | 0.028 | 0.013 | 0.910 | 0.993 |
Read our full paper for more insights.
⏳ GPU Memory Usage and Runtime
We report the peak GPU memory usage (VRAM) and runtime of our full model for processing each streaming input using the StreamSession implementation. All experiments were conducted at a common resolution of 518 × 384 on a single H200 GPU. The benchmark includes both Causal for causal attention and Window for sliding window attention with a window size of 5.
*Run Time (s)
Related Skills
node-connect
345.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
106.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
345.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
345.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
