SkillAgentSearch skills...

VMamba

VMamba: Visual State Space Models,code is based on mamba

Install / Use

/learn @MzeroMiko/VMamba
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <h1>VMamba </h1> <h3>VMamba: Visual State Space Model</h3>

Yue Liu<sup>1</sup>,Yunjie Tian<sup>1</sup>,Yuzhong Zhao<sup>1</sup>, Hongtian Yu<sup>1</sup>, Lingxi Xie<sup>2</sup>, Yaowei Wang<sup>3</sup>, Qixiang Ye<sup>1</sup>, Yunfan Liu<sup>1</sup>

<sup>1</sup> University of Chinese Academy of Sciences, <sup>2</sup> HUAWEI Inc., <sup>3</sup> PengCheng Lab.

Paper: (arXiv 2401.10166)

</div>

🔥 use VMamba with only one file and in fewest steps !

conda create -n vmamba python=3.10
pip install torch==2.2 torchvision torchaudio triton pytest chardet yacs termcolor fvcore seaborn packaging ninja einops numpy==1.24.4 timm==0.4.12
pip install https://github.com/state-spaces/mamba/releases/download/v2.2.4/mamba_ssm-2.2.4+cu12torch2.2cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
python vmamba.py

:white_check_mark: Updates

  • Sep. 25th, 2024: Update: VMamba is accepted by NeurIPS2024 (spotlight)!
  • June. 14th, 2024: Update: we clean the code to be easier to read; we add support for mamba2.
  • May. 26th, 2024: Update: we release the updated weights of VMambav2, together with the new arxiv paper.
  • May. 7th, 2024: Update: Important! using torch.backends.cudnn.enabled=True in downstream tasks may be quite slow. If you found vmamba quite slow in your machine, disable it in vmamba.py, else, ignore this.
  • ...

for details see detailed_updates.md

Abstract

Designing computationally efficient network architectures persists as an ongoing necessity in computer vision. In this paper, we transplant Mamba, a state-space language model, into VMamba, a vision backbone that works in linear time complexity. At the core of VMamba lies a stack of Visual State-Space (VSS) blocks with the 2D Selective Scan (SS2D) module. By traversing along four scanning routes, SS2D helps bridge the gap between the ordered nature of 1D selective scan and the non-sequential structure of 2D vision data, which facilitates the gathering of contextual information from various sources and perspectives. Based on the VSS blocks, we develop a family of VMamba architectures and accelerate them through a succession of architectural and implementation enhancements. Extensive experiments showcase VMamba’s promising performance across diverse visual perception tasks, highlighting its advantages in input scaling efficiency compared to existing benchmark models.

Overview

  • VMamba serves as a general-purpose backbone for computer vision.
<p align="center"> <img src="assets/architecture.png" alt="architecture" width="80%"> </p>
  • 2D-Selective-Scan of VMamba
<p align="center"> <img src="assets/ss2d.png" alt="arch" width="80%"> </p>
  • VMamba has global effective receptive field
<p align="center"> <img src="assets/erf.png" alt="erf" width="80%"> </p>
  • VMamba resembles Transformer-Based Methods in Activation Map
<p align="center"> <img src="assets/attn.png" alt="attn" width="80%"> </p> <p align="center"> <img src="assets/activation_map.png" alt="activation" width="80%"> </p>

Main Results

<!-- copied from assets/performance.md --> <!-- :book: --> <!-- ***The checkpoints of some of the models listed below will be released in weeks!*** -->

:book: For details see performance.md.

Classification on ImageNet-1K

| name | pretrain | resolution |acc@1 | #params | FLOPs | TP. | Train TP. | configs/logs/ckpts | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Swin-T | ImageNet-1K | 224x224 | 81.2 | 28M | 4.5G | 1244 |987 | -- | | Swin-S | ImageNet-1K | 224x224 | 83.2 | 50M | 8.7G | 718 |642 | -- | | Swin-B | ImageNet-1K | 224x224 | 83.5 | 88M | 15.4G | 458 |496 | -- | | VMamba-S[s2l15] | ImageNet-1K | 224x224 | 83.6 | 50M | 8.7G | 877 | 314| config/log/ckpt | | VMamba-B[s2l15] | ImageNet-1K | 224x224 | 83.9 | 89M | 15.4G | 646 | 247 | config/log/ckpt | | VMamba-T[s1l8] | ImageNet-1K | 224x224 | 82.6 | 30M | 4.9G | 1686| 571| config/log/ckpt |

  • Models in this subsection is trained from scratch with random or manual initialization. The hyper-parameters are inherited from Swin, except for drop_path_rate and EMA. All models are trained with EMA except for the Vanilla-VMamba-T.
  • TP.(Throughput) and Train TP. (Train Throughput) are assessed on an A100 GPU paired with an AMD EPYC 7542 CPU, with batch size 128. Train TP. is tested with mix-resolution, excluding the time consumption of optimizers.
  • FLOPs and parameters are now gathered with head (In previous versions, they were counted without head, so the numbers raise a little bit).
  • we calculate FLOPs with the algorithm @albertgu provides, which will be bigger than previous calculation (which is based on the selective_scan_ref function, and ignores the hardware-aware algorithm).

Object Detection on COCO

| Backbone | #params | FLOPs | Detector | bboxAP | bboxAP50 | bboxAP75 | segmAP | segmAP50 | segmAP75 | configs/logs/ckpts | | :---: | :---: | :---: | :---: | :---: | :---: |:---: |:---: |:---: |:---: |:---: | | Swin-T | 48M | 267G | MaskRCNN@1x | 42.7 |65.2 |46.8 |39.3 |62.2 |42.2 |-- | | Swin-S | 69M | 354G | MaskRCNN@1x | 44.8 |66.6 |48.9 |40.9 |63.4 |44.2 |-- |-- | | Swin-B | 107M | 496G | MaskRCNN@1x | 46.9|--|--| 42.3|--|--|-- |-- | | VMamba-S[s2l15] | 70M | 384G | MaskRCNN@1x | 48.7 |70.0 |53.4 |43.7 |67.3 |47.0 | config/log/ckpt | | VMamba-B[s2l15] | 108M | 485G | MaskRCNN@1x | 49.2 |71.4 |54.0 |44.1 |68.3 |47.7 | config/log/ckpt | | VMamba-B[s2l15] | 108M | 485G | MaskRCNN@1x[bs8] | 49.2 |70.9 |53.9 |43.9 |67.7 |47.6 | config/log/ckpt | | VMamba-T[s1l8] | 50M | 271G | MaskRCNN@1x | 47.3 |69.3 |52.0 |42.7 |66.4 |45.9 | config/log/ckpt | | :---: | :---: | :---: | :---: | :---: | :---: |:---: |:---: |:---: |:---: |:---: |:---: |:---: | | Swin-T | 48M | 267G | MaskRCNN@3x | 46.0 |68.1 |50.3 |41.6 |65.1 |44.9 |-- | | Swin-S | 69M | 354G | MaskRCNN@3x | 48.2 |69.8 |52.8 |43.2 |67.0 |46.1 |-- | | VMamba-S[s2l15] | 70M | 384G | MaskRCNN@3x | 49.9 |70.9 |54.7 |44.20 |68.2 |47.7 | config/log/ckpt | | VMamba-T[s1l8] | 50M | 271G | MaskRCNN@3x | 48.8 |70.4 |53.50 |43.7 |67.4 |47.0 | config/log/ckpt |

  • Models in this subsection is initialized from the models trained in classfication.
  • we now calculate FLOPs with the algrithm @albertgu provides, which will be bigger than previous calculation (which is based on the selective_scan_ref function, and ignores the hardware-aware algrithm).

Semantic Segmentation on ADE20K

| Backbone | Input| #params |

View on GitHub
GitHub Stars3.1k
CategoryDevelopment
Updated18m ago
Forks228

Languages

Python

Security Score

95/100

Audited on Mar 28, 2026

No findings