Mamba
Mamba SSM architecture
Install / Use
/learn @state-spaces/MambaREADME
Mamba

Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu*, Tri Dao*
Paper: https://arxiv.org/abs/2312.00752

Transformers are SSMs: Generalized Models and Efficient Algorithms
Through Structured State Space Duality
Tri Dao*, Albert Gu*
Paper: https://arxiv.org/abs/2405.21060

Mamba-3: Improved Sequence Modeling using State Space Principles
Through Structured State Space Duality
Aakash Lahoti*, Kevin Y. Li*, Berlin Chen*, Caitlin Wang*, Aviv Bick, J. Zico Kolter, Tri Dao†, Albert Gu†
Paper: https://arxiv.org/abs/2603.15569
About
Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.
Installation
Install PyTorch first, then:
- [Option]
pip install causal-conv1d>=1.4.0 --no-build-isolation: an efficient implementation of a simple causal Conv1d layer used inside the Mamba block. pip install mamba-ssm --no-build-isolation: the core Mamba package.pip install mamba-ssm[causal-conv1d] --no-build-isolation: To install core Mamba package and causal-conv1d.
--no-build-isolation is required so that pip uses your existing CUDA-enabled PyTorch instead of installing torch-cpu in an isolated build environment.
NOTE: To use Mamba-3, please install from source MAMBA_FORCE_BUILD=TRUE pip install --no-cache-dir --force-reinstall git+https://github.com/state-spaces/mamba.git --no-build-isolation.
Other requirements:
- Linux
- NVIDIA GPU
- PyTorch 1.12+
- CUDA 11.6+
For AMD cards, see additional prerequisites below.
Usage
We expose several levels of interface with the Mamba model.
Selective SSM
Mamba is based on a selective SSM layer, which is the focus of the paper (Section 3; Algorithm 2).
Source: ops/selective_scan_interface.py.
Mamba Block
The main module of this repository is the Mamba architecture block wrapping the selective SSM.
Source: modules/mamba_simple.py.
Usage:
import torch
from mamba_ssm import Mamba
batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")
model = Mamba(
# This module uses roughly 3 * expand * d_model^2 parameters
d_model=dim, # Model dimension d_model
d_state=16, # SSM state expansion factor
d_conv=4, # Local convolution width
expand=2, # Block expansion factor
).to("cuda")
y = model(x)
assert y.shape == x.shape
Mamba-2
The Mamba-2 block is implemented at modules/mamba2.py.
A simpler version is at modules/mamba2_simple.py
The usage is similar to Mamba(-1):
from mamba_ssm import Mamba2
model = Mamba2(
# This module uses roughly 3 * expand * d_model^2 parameters
d_model=dim, # Model dimension d_model
d_state=64, # SSM state expansion factor, typically 64 or 128
d_conv=4, # Local convolution width
expand=2, # Block expansion factor
).to("cuda")
y = model(x)
assert y.shape == x.shape
SSD
A minimal version of the inner SSD module (Listing 1 from the Mamba-2 paper) with conversion between "discrete" and "continuous" SSM versions is at modules/ssd_minimal.py.
Mamba-3
The Mamba-3 block is implemented at modules/mamba3.py.
The usage is as follows:
from mamba_ssm import Mamba3
batch, length, dim = 2, 2048, 768
x = torch.randn(batch, length, dim).to(torch.bfloat16).to("cuda")
model = Mamba3(
# This module uses roughly 6 * d_model^2 parameters
d_model=dim, # Model dimension d_model
d_state=128, # SSM state size
headdim=64, # SSM headdim
is_mimo=True, # Use MIMO mode
mimo_rank=4, # MIMO rank when is_mimo=True
chunk_size=16, # 64/mimo_rank if x is in bf16, else 32/mimo_rank
is_outproj_norm=False, # Additional post SSM norm
dtype=torch.bfloat16,
).to("cuda")
y = model(x)
assert y.shape == x.shape
Mamba Language Model
Finally, we provide an example of a complete language model: a deep sequence model backbone (with repeating Mamba blocks) + language model head.
Source: models/mixer_seq_simple.py.
This is an example of how to integrate Mamba into an end-to-end neural network. This example is used in the generation scripts below.
Pretrained Models
Pretrained models are uploaded to
Hugging Face: mamba-130m, mamba-370m,
mamba-790m, mamba-1.4b, mamba-2.8b, mamba2-130m, mamba2-370m,
mamba2-780m, mamba2-1.3b, mamba2-2.7b, transformerpp-2.7b, mamba2attn-2.7b, trained on 300B tokens on the Pile, as well as mamba-2.8b-slimpj
(trained on 600B tokens on the SlimPajama dataset).
The models will be autodownloaded by the generation script below.
These models were trained on the Pile, and follow the standard model dimensions described by GPT-3 and followed by many open source models:
| Parameters | Layers | Model dim. | |------------|--------|------------| | 130M | 24 | 768 | | 370M | 48 | 1024 | | 790M | 48 | 1536 | | 1.4B | 48 | 2048 | | 2.8B | 64 | 2560 |
(The layer count of Mamba doubles that of a Transformer with similar size, as two Mamba blocks are needed for each "layer" (MHA block + MLP block) of a Transformer.)
Note: these are base models trained only for 300B tokens, without any form of downstream modification (instruction tuning, etc.). Performance is expected to be comparable or better than other architectures trained on similar data, but not to match larger or fine-tuned models.
Evaluations
To run zero-shot evaluations of models (corresponding to Table 3 of the paper), we use the lm-evaluation-harness library.
- Install
lm-evaluation-harnessbypip install lm-eval==0.4.2. - Run evaluation with (more documentation at the lm-evaluation-harness repo):
lm_eval --model mamba_ssm --model_args pretrained=state-spaces/mamba-130m --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa --device cuda --batch_size 256
python evals/lm_harness_eval.py --model hf --model_args pretrained=EleutherAI/pythia-160m --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande --device cuda --batch_size 64
To reproduce the results on the mamba-2.8b-slimpj model reported in the blogposts:
lm_eval --model mamba_ssm --model_args pretrained=state-spaces/mamba-2.8b-slimpj --tasks boolq,piqa,hellaswag,winogrande,arc_easy,arc_challenge,openbookqa,race,truthfulqa_mc2 --device cuda --batch_size 256
lm_eval --model mamba_ssm --model_args pretrained=state-spaces/mamba-2.8b-slimpj --tasks mmlu --num_fewshot 5 --device cuda --batch_size 256
To run evaluations on Mamba-2 models, simply replace the model names:
lm_eval --model mamba_ssm --model_args pretrained=state-spaces/mamba2-2.7b --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa --device cuda --batch_size 256
lm_eval --model mamba_ssm --model_args pretrained=state-spaces/transformerpp-2.7b --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa --device cuda --batch_size 256
lm_eval --model mamba_ssm --model_args pretrained=state-spaces/mamba2attn-2.7b --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa --device cuda --batch_size 256
Note that the result of each task might differ from reported values by 0.1-0.3 due to noise in the evaluation process.
Inference
The script benchmarks/benchmark_generation_mamba_simple.py
- autoloads a model from the Hugging Face Hub,
- generates completions of a user-specified prompt,
- benchmarks the inference speed of this generation.
Other configurable options include the top-p (nucleus sampling) probability, and the softmax temperature.
Examples
To test generation latency (e.g. batch size = 1) with different sampling strategies:
python benchmarks/benchmark_generation_mamba_simple.py --model-name "state-spaces/mamba-2.8b" --prompt "My cat wrote all this CUDA code for a new language model and" --topp 0.9 --temperature 0.7 --repetition-penalty 1.2
python benchmarks/benchmark_generation_mamba_simple.py --model-name "EleutherAI/pythia-2.8b" --prompt "My cat wrote all this CUDA code for a new language model and" --topp 0.9 --temperature 0.7 --repetition-penalty 1.2
python benchmarks/benchmark_generation_mamba_simple.py --model-name "state-spaces/mamba-2.8b" --prompt "My cat wrote all this CUDA code for a new language model and" --minp 0.05 --topk 0 --temperature 0.7 --repetition-penalty 1.2
To test generation throughput with random prompts (e.g. large batch size):
python benchmarks/benchmark_generation_mamba_simple.py --model-name "state-spaces/mamba-2.8b" --batch 64
python benchmarks/benchmark_generation_mamba_simple.py --model-name "EleutherAI/pythia-2.8b" --batch 64
With Mamba-2, you just need to change the model name:
python benchmarks/benchmark_generation_mamba_simple.py --model-name "state-spaces/mamba2-2.7b" --prompt "My cat wrote al
Related Skills
node-connect
339.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.8kCommit, push, and open a PR
