MFuser

[CVPR 2025 Highlight] Official code for paper "Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation"

Generate Convert Improve

Install / Use

/learn @devinxzhang/MFuser

About this skill

Quality Score

0/100

README

[CVPR 2025 Highlight] Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

Xin Zhang, Robby T. Tan
National University of Singapore
CVPR 2025

[`Project Page`] [`Paper`]

Environment

Requirements

The requirements can be installed with:

conda create -n mfuser python=3.9 numpy=1.26.4
conda activate mfuser
conda install pytorch==2.0.1 torchvision==0.15.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
pip install xformers==0.0.20
pip install mmcv-full==1.5.1 
pip install mamba_ssm==2.2.2
pip install causal_conv1d==1.4.0

Pre-trained VFM & VLM Models

Please download the pre-trained VFM and VLM models and save them in ./pretrained folder.

| Model | Type | Link | |-----|-----|:-----:| | DINOv2 | dinov2_vitl14_pretrain.pth |download link| | CLIP | ViT-L-14-336px.pt |download link| | EVA02-CLIP | EVA02_CLIP_L_336_psz14_s6B.pt |download link| | SIGLIP | siglip_vitl16_384.pth |download link|

Checkpoints

You can download MFuser model checkpoints and save them in ./work_dirs_d folder. By default, all experiments below use DINOv2-L as the VFM.

| Model | Pretrained | Trained on | Config | Link | |-----|-----|-----|-----|:-----:| | mfuser-clip-vit-l-city | CLIP | Cityscapes | config |download link| | mfuser-clip-vit-l-gta | CLIP | GTA5 | config |download link| | mfuser-eva02-clip-vit-l-city | EVA02-CLIP | Cityscapes | config |download link|
| mfuser-eva02-clip-vit-l-gta | EVA02-CLIP | GTA5 | config |download link| | mfuser-siglip-vit-l-city | SIGLIP | Cityscapes | config |download link| | mfuser-siglip-vit-l-gta | SIGLIP | GTA5 | config |download link|

Datasets

To set up datasets, please follow the official TLDR repo.

After downloading the datasets, edit the data folder root in the dataset config files following your environment.

src_dataset_dict = dict(..., data_root='[YOUR_DATA_FOLDER_ROOT]', ...)
tgt_dataset_dict = dict(..., data_root='[YOUR_DATA_FOLDER_ROOT]', ...)

The final folder structure should look like this:

MFuser
├── ...
├── pretrained
│   ├── dinov2_vitl14_pretrain.pth
│   ├── EVA02_CLIP_L_336_psz14_s6B.pt
│   ├── siglip_vitl16_384.pth
│   ├── ViT-L-14-336px.pt
├── data
│   ├── cityscapes
│   │   ├── leftImg8bit
│   │   │   ├── train
│   │   │   ├── val
│   │   ├── gtFine
│   │   │   ├── train
│   │   │   ├── val
│   ├── bdd100k
│   │   ├── images
│   │   |   ├── 10k
│   │   │   |    ├── train
│   │   │   |    ├── val
│   │   ├── labels
│   │   |   ├── sem_seg
│   │   |   |    ├── masks
│   │   │   |    |    ├── train
│   │   │   |    |    ├── val
│   ├── mapillary
│   │   ├── training
│   │   ├── cityscapes_trainIdLabel
│   │   ├── half
│   │   │   ├── val_img
│   │   │   ├── val_label
│   ├── gta
│   │   ├── images
│   │   ├── labels
├── ...

Training

python train.py configs/[TRAIN_CONFIG]

Evaluation

Run the evaluation:

python test.py configs/[TEST_CONFIG] work_dirs_d/[MODEL] --eval mIoU

Citation

If you find our code helpful, please cite our paper:

@article{zhang2025mamba,
  title     = {Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation},
  author    = {Zhang, Xin and Robby T., Tan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2025},
}

Acknowledgements

This project is based on the following open-source projects. We thank the authors for sharing their codes.

Related Skills

node-connect

342.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

85.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

342.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

342.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

devinxzhang

View profile

View on GitHub

GitHub Stars59

CategoryDevelopment

Updated8h ago

Forks5

devinxzhang/MFuser

Languages

Python

Security Score

95/100

Audited on Mar 31, 2026

No findings

MFuser

Install / Use

README

[CVPR 2025 Highlight] Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation

[Project Page] [Paper]

Environment

Requirements

Pre-trained VFM & VLM Models

Checkpoints

Datasets

Training

Evaluation

Citation

Acknowledgements

Related Skills

[`Project Page`] [`Paper`]