MFuser
[CVPR 2025 Highlight] Official code for paper "Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation"
Install / Use
/learn @devinxzhang/MFuserREADME
[CVPR 2025 Highlight] Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation
Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation
Xin Zhang, Robby T. Tan
National University of Singapore
CVPR 2025
[Project Page] [Paper]
Environment
Requirements
-
The requirements can be installed with:
conda create -n mfuser python=3.9 numpy=1.26.4 conda activate mfuser conda install pytorch==2.0.1 torchvision==0.15.2 pytorch-cuda=11.8 -c pytorch -c nvidia pip install -r requirements.txt pip install xformers==0.0.20 pip install mmcv-full==1.5.1 pip install mamba_ssm==2.2.2 pip install causal_conv1d==1.4.0
Pre-trained VFM & VLM Models
-
Please download the pre-trained VFM and VLM models and save them in
./pretrainedfolder.| Model | Type | Link | |-----|-----|:-----:| | DINOv2 |
dinov2_vitl14_pretrain.pth|download link| | CLIP |ViT-L-14-336px.pt|download link| | EVA02-CLIP |EVA02_CLIP_L_336_psz14_s6B.pt|download link| | SIGLIP |siglip_vitl16_384.pth|download link|
Checkpoints
-
You can download MFuser model checkpoints and save them in
./work_dirs_dfolder. By default, all experiments below use DINOv2-L as the VFM.| Model | Pretrained | Trained on | Config | Link | |-----|-----|-----|-----|:-----:| |
mfuser-clip-vit-l-city| CLIP | Cityscapes | config |download link| |mfuser-clip-vit-l-gta| CLIP | GTA5 | config |download link| |mfuser-eva02-clip-vit-l-city| EVA02-CLIP | Cityscapes | config |download link|
|mfuser-eva02-clip-vit-l-gta| EVA02-CLIP | GTA5 | config |download link| |mfuser-siglip-vit-l-city| SIGLIP | Cityscapes | config |download link| |mfuser-siglip-vit-l-gta| SIGLIP | GTA5 | config |download link|
Datasets
-
To set up datasets, please follow the official TLDR repo.
-
After downloading the datasets, edit the data folder root in the dataset config files following your environment.
src_dataset_dict = dict(..., data_root='[YOUR_DATA_FOLDER_ROOT]', ...) tgt_dataset_dict = dict(..., data_root='[YOUR_DATA_FOLDER_ROOT]', ...) -
The final folder structure should look like this:
MFuser
├── ...
├── pretrained
│ ├── dinov2_vitl14_pretrain.pth
│ ├── EVA02_CLIP_L_336_psz14_s6B.pt
│ ├── siglip_vitl16_384.pth
│ ├── ViT-L-14-336px.pt
├── data
│ ├── cityscapes
│ │ ├── leftImg8bit
│ │ │ ├── train
│ │ │ ├── val
│ │ ├── gtFine
│ │ │ ├── train
│ │ │ ├── val
│ ├── bdd100k
│ │ ├── images
│ │ | ├── 10k
│ │ │ | ├── train
│ │ │ | ├── val
│ │ ├── labels
│ │ | ├── sem_seg
│ │ | | ├── masks
│ │ │ | | ├── train
│ │ │ | | ├── val
│ ├── mapillary
│ │ ├── training
│ │ ├── cityscapes_trainIdLabel
│ │ ├── half
│ │ │ ├── val_img
│ │ │ ├── val_label
│ ├── gta
│ │ ├── images
│ │ ├── labels
├── ...
Training
python train.py configs/[TRAIN_CONFIG]
Evaluation
Run the evaluation:
python test.py configs/[TEST_CONFIG] work_dirs_d/[MODEL] --eval mIoU
Citation
If you find our code helpful, please cite our paper:
@article{zhang2025mamba,
title = {Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation},
author = {Zhang, Xin and Robby T., Tan},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
}
Acknowledgements
This project is based on the following open-source projects. We thank the authors for sharing their codes.
Related Skills
node-connect
342.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
85.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
342.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
342.5kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
