[CVPR 2026] MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation

Changli Wu1,2,†, Haodong Wang1,†, Jiayi Ji1,*,Yutian Yao5,
Chunsai Du4, Jihua Kang4, Yanwei Fu3,2, Liujuan Cao1

1Xiamen University, 2Shanghai Innovation Institute, 3Fudan University,
4ByteDance, 5Tianjin University of Science and Technology

†Equal Contribution, *Corresponding Author

</div>

📢 News & Roadmap

🎉 [News] Our paper has been accepted to CVPR 2026! 🎉

This repository is the official implementation of MVGGT. All resources have been fully released. We warmly welcome everyone to try out our code, models, and the interactive demo!

[x] Release the MVRefer Benchmark
[x] Release Training & Inference Code.
[x] Release Pre-trained Models.
[x] Release Interactive Demo Code (Local version).

📖 Abstract

Most existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints.

We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly.

We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning. Experiments show that MVGGT establishes the first strong baseline and achieves both high accuracy and fast inference, outperforming existing alternatives.

<div align="center"> <img src="https://sosppxo.github.io/mvggt.github.io/resources/figure1.png" width="80%"> Figure 1: Comparison of the proposed MV-3DRES task (bottom) against the traditional two-stage pipeline (top). </div>

🚀 Method: MVGGT

We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an end-to-end framework designed for efficiency and robustness.

MVGGT Architecture Figure 2: Architecture of MVGGT. It features a Frozen Reconstruction Branch (top) and a Trainable Multimodal Branch (bottom).

Note: For interactive 3D visualizations and video comparisons with other methods, please visit our Project Page.

🛠️ Installation

1. Clone the repository

git clone https://github.com/sosppxo/mvggt.git
cd mvggt

2. Create conda environment

Create and activate a new conda environment:

conda create -n mvggt python=3.12
conda activate mvggt

3. Install dependencies

Install the full requirements for training:

pip install -r requirements.txt

📂 Data Preparation

1. ScanNet dataset

Download the ScanNet dataset. The data should be organized as follows:

[data_root]/
|   └──scene0000_00/
│       ├── color/              # RGB images (.jpg)
│       ├── depth/              # Depth maps (.png)
│       ├── intrinsic/          # intrinsic_depth.txt
│       └── pose/               # Camera poses (.txt)
└── scans/                  # Required for 2D instance masks
    └── scene0000_00/
        └── scene0000_00_2d-instance-filt/
            └── instance-filt/  # 2D instance segmentation masks (.png)

⚠️ Note: Remember to update the data_root path for both train_dataset and test_dataset in configs/data/example.yaml to point to your actual [data_root].

2. ScanRefer dataset

Download ScanRefer annotations.

Put the ScanRefer folder in data/:

data/
└── ScanRefer/
    ├── ScanRefer_filtered_train.json
    ├── ScanRefer_filtered_val.json
    ├── ScanRefer_filtered_train.txt
    └── ScanRefer_filtered_val.txt

3. Invalid Frame List

If you need to regenerate the invalid frame list based on your data, run:

python scripts/generate_invalid_scannet_list.py

4. Scene Frame Indices

To enable target-centric sampling and ensure the model sees the referred objects during training, we use pre-computed instance-to-frame mapping:

data/
└── scene_frame_indices/
    └── [scene_id].json

5. MVrefer Benchmark (mvrefer_val.json)

MVrefer benchmark contains frame selections for evaluation:

data/
└── mvrefer_val.json

📦 Model Weights

For training and inference, you need to prepare the following weights in the ckpts/ directory:

1. Pi3 Weights

The multimodal branch is initialized from Pi3. Download and place it in:

ckpts/
└── Pi3/

2. RoBERTa Weights

The model uses RoBERTa-base as the text encoder. Download and place it in:

ckpts/
└── roberta-base

3. Pre-trained MVGGT (For Inference)

Download the final MVGGT checkpoint from Hugging Face and update train.resume in eval_mvggt.sh.

🚀 Training

To start training on ScanRefer:

bash train_mvggt.sh

🔍 Inference

Update the checkpoint path (train.resume) in eval_mvggt.sh, then run inference:

bash eval_mvggt.sh

🚀 Demo Deployment

Follow these steps to deploy the interactive demo locally:

1. Install demo dependencies

Install the required packages for the demo:

pip install -r requirements_demo.txt

2. Download model weights and tokenizer

Download pre-trained model weights: Download from Hugging Face and update the ckpt_path in demo_gradio.py (line 608) to point to your checkpoint file.
Download RoBERTa tokenizer: The demo requires RoBERTa tokenizer. Download it using:

mkdir ckpts
python -c "from transformers import RobertaTokenizer; RobertaTokenizer.from_pretrained('roberta-base').save_pretrained('./ckpts/roberta-base')"

Or manually download from Hugging Face and place it in ./ckpts/roberta-base/.

3. Launch the demo

Run the Gradio demo:

python demo_gradio.py

The demo will be available at http://localhost:7860.

Usage

Upload multiple images or a video containing multi-view scenes
Enter a referring expression describing the target object
The model will generate 3D segmentation results that can be downloaded as GLB files

📝 Citation

If you find our work useful in your research, please consider citing:

@misc{wu2026mvggt,
  Author = {Changli Wu and Haodong Wang and Jiayi Ji and Yutian Yao and Chunsai Du and Jihua Kang and Yanwei Fu and Liujuan Cao},
  Title = {MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation},
  Year = {2026}
}

Mvggt

Install / Use

README