Vireo
[NeurIPS 2025] Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation
Install / Use
/learn @SY-Ch/VireoREADME
[NeurIPS 2025] Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation
This is early-stage code, not the full implementation. The latest version will be released in about two weeks. Thanks for your understanding.
Installation & Environment Setup
Clone the repository:
git clone --recursive https://github.com/anonymouse-9c53tp182bvz/Vireo.git
Follow these steps to set up your environment:
conda create -n vireo python=3.11 -y
conda activate vireo
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124 #2.6.0
pip install -U openmim
mim install mmengine
#install mmcv
git clone https://github.com/open-mmlab/mmcv.git
cd mmcv
pip install -r requirements/optional.txt
pip install -e . -v
pip install mmsegmentation
pip install mmdet
pip install xformers=='0.0.30' # optional for DINOv2
pip install -r requirements.txt
pip install future tensorboard
and please read Initial_DA
Dataset Preparation
Prepare the datasets by converting them to the required formats. Run the following commands:
cd Vireo
mkdir data
# Convert GTA dataset (source domain)
python tools/convert_datasets/gta.py data/gta
# Prepare Cityscapes dataset
python tools/convert_datasets/cityscapes.py data/cityscapes
# Convert Mapillary to Cityscapes format (training data)
python tools/convert_datasets/mapillary2cityscape.py data/mapillary data/mapillary/cityscapes_trainIdLabel --train_id
# Resize Mapillary validation images to Cityscapes format
python tools/convert_datasets/mapillary_resize.py data/mapillary/validation/images data/mapillary/cityscapes_trainIdLabel/val/label data/mapillary/half/val_img data/mapillary/half/val_label
The final folder structure should look like this:
Vireo
├── ...
├── checkpoints
│ ├── dinov2_vitl14_pretrain.pth
│ ├── promptda_vitl.ckpt
│ ├── depth_anything_v2_vitl.pth
│ ├── dinov2_converted_depth.pth
│ ├── dinov2_converted_depthv2.pth
├── data
│ ├── cityscapes
│ │ ├── leftImg8bit
│ │ │ ├── train
│ │ │ ├── val
│ │ ├── gtFine
│ │ │ ├── train
│ │ │ ├── val
│ ├── bdd100k
│ │ ├── images
│ │ │ ├── 10k
│ │ │ │ ├── train
│ │ │ │ ├── val
│ │ ├── labels
│ │ │ ├── sem_seg
│ │ │ │ ├── masks
│ │ │ │ │ ├── train
│ │ │ │ │ ├── val
│ ├── mapillary
│ │ ├── training
│ │ ├── cityscapes_trainIdLabel
│ │ ├── half
│ │ │ ├── val_img
│ │ │ ├── val_label
│ ├── gta
│ │ ├── images
│ │ ├── labels
├── ├── adac
│ │ ├── gt
│ │ │ ├── fog
│ │ │ ├── night
│ │ │ ├── rain
│ │ │ ├── snow
│ │ ├── rgb_anon
│ │ │ ├── fog
│ │ │ ├── night
│ │ │ ├── rain
│ │ │ ├── snow
├── ...
Pre-trained Weights & Dataset Downloads
Download: Download the pre-trained weights for testing from facebookresearch. Ensure the file name remains unchanged and place it in the project directory. You can also download the DepthAnything weights from DepthAnything GitHub.
Convert:
Convert the pre-trained weights for training or evaluation by running:
python tools/convert_models/convert_dinov2_depth.py checkpoints/dinov2_vitl14_pretrain.pth checkpoints/depth_anything_v2_vitl.pth checkpoints/dinov2_converted_depth.pth
Optional: Converting for 1024×1024 Resolution
python tools/convert_models/convert_dinov2_depth.py checkpoints/dinov2_vitl14_pretrain.pth checkpoints/depth_anything_v2_vitl.pth checkpoints/dinov2_converted_depth_1024x1024.pth --height 1024 --width 1024
Convert the pre-trained weights for Vireo V2 training or evaluation by running:
python tools/convert_models/convert_dinov2_depthv2.py checkpoints/dinov2_vitl14_pretrain.pth checkpoints/promptda_vitl.ckpt checkpoints/dinov2_converted_depthv2.pth
Optional: Vireo V2 Conversion for 1024×1024 Resolution
python tools/convert_models/convert_dinov2_depthv2.py checkpoints/dinov2_vitl14_pretrain.pth checkpoints/promptda_vitl.ckpt checkpoints/dinov2_converted_depth_1024x1024.pth
Training
Use the following commands to start training with different configurations. If you need to resume training from a checkpoint, simply append --resume to the command.
Tips: If resuming training appears to hang or shows no response for a long time, please refer to this issue for potential solutions.
-
Cityscapes → BDD100K + Mapillary + ADAC (fog, night, rain, snow):
python tools/train.py configs/dinov2_domain/vireo_dinov2_mask2former_512x512_bs1x4_citys.py # To resume training, use: # python tools/train.py configs/dinov2_domain/vireo_dinov2_mask2former_512x512_bs1x4_citys.py --resume -
GTAV → BDD100K + Mapillary + Cityscapes:
python tools/train.py configs/dinov2_domain/vireo_dinov2_mask2former_512x512_bs1x4.py # To resume training, use: # python tools/train.py configs/dinov2_domain/vireo_dinov2_mask2former_512x512_bs1x4.py --resume
For the updated Vireo V2 architecture, use these commands:
-
Cityscapes Configuration (Vireo V2):
python tools/train.py configs/dinov2_domain/vireov2_dinov2_mask2former_512x512_bs1x4_citys.py # To resume training, use: # python tools/train.py configs/dinov2_domain/vireov2_dinov2_mask2former_512x512_bs1x4_citys.py --resume -
GTAV Configuration (Vireo V2):
python tools/train.py configs/dinov2_domain/vireov2_dinov2_mask2former_512x512_bs1x4.py # To resume training, use: # python tools/train.py configs/dinov2_domain/vireov2_dinov2_mask2former_512x512_bs1x4.py --resume
Evaluation
To evaluate a trained model, replace <Vireo model>.pth with your model file and run the corresponding command. The backbone checkpoint checkpoints/dinov2_converted.pth is used in all evaluations:
-
Evaluation with GTAV-based Configuration:
python tools/test.py configs/dinov2_domain/vireo_dinov2_mask2former_512x512_bs1x4.py <Vireo model>.pth --backbone checkpoints/dinov2_converted.pth python tools/test.py configs/dinov2_ov/vireo_dinov2_mask2former_ov_512x512_bs1x4_deliver.py <Vireo model>.pth --backbone checkpoints/dinov2_converted.pth -
Evaluation with Cityscapes-based Configuration:
python tools/test.py configs/dinov2_domain/vireo_dinov2_mask2former_512x512_bs1x4_citys.py <Vireo model>.pth --backbone checkpoints/dinov2_converted.pth python tools/test.py configs/dinov2_ov/vireo_dinov2_mask2former_ov_512x512_bs1x4_deliver.py <Vireo model>.pth --backbone checkpoints/dinov2_converted.pth -
Evaluation with Vireo V2 Cityscapes Configuration:
python tools/test.py configs/dinov2_domain/vireov2_dinov2_mask2former_512x512_bs1x4_citys.py <Vireo model>.pth --backbone checkpoints/dinov2_converted.pth python tools/test.py configs/dinov2_ov/vireov2_dinov2_mask2former_ov_512x512_bs1x4_deliver.py <Vireo model>.pth --backbone checkpoints/dinov2_converted.pth -
Evaluation with Vireo V2 GTAV Configuration:
python tools/test.py configs/dinov2_domain/vireo_dinov2_mask2former_512x512_bs1x4_citys.py <Vireo model>.pth --backbone checkpoints/dinov2_converted.pth python tools/test.py configs/dinov2_ov/vireov2_dinov2_mask2former_ov_512x512_bs1x4_deliver.py <Vireo model>.pth --backbone checkpoints/dinov2_converted.pth
Acknowledgment
Our implementation is mainly based on following repositories. Thanks for their authors.
Related Skills
node-connect
349.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
