Molmo2
Code for the Molmo2 Vision-Language Model
Install / Use
/learn @allenai/Molmo2README
This repository is for training and using Ai2's open vision language models, Molmo2 and MolmoPoint. Molmo2 is state-of-the-art among open-source models and demonstrates exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks as shown below. MolmoPoint is an extension with a new architecture for pointing. This README is mostly concerned with Molmo2, see MolmoPoint for how to train MolmoPoint.
<div align="center"> <img src="assets/molmo2_capabilities.png" alt="Molmo2 Capabilites" width="1200" style="margin-left:'auto' margin-right:'auto' display:'block'"/> </div>See our blog post or our paper for more details about Molmo2. Huggingface models can be found here.
Table of Contents
- Setup
- Training and Evaluations
- Transformers and vLLM
- Code
Setup
Installation
We recommend using python >= 3.11 First install PyTorch according to the instructions specific to your operating system.
To install dependencies, run:
git clone https://github.com/allenai/molmo2.git
cd molmo2
pip install torchcodec
pip install -e .[all]
It's recommended to install torchcodec separately since it has some complex dependencies that
can break if installed in combination with the others as done using install -e .[all]
Docker
We provide a container with the dependencies (but not the code) pre-installed, pull it with:
docker pull ghcr.io/allenai/molmo2:latest
Downloading Data
Molmo2 uses a mix of huggingface datasets and custom data stored in MOLMO_DATA_DIR.
For example, if you want to store the data in /data/molmo you could set
export MOLMO_DATA_DIR=/data/molmo
export HF_HOME=/data/molmo/huggingface
See here for more info on where the huggingface data is stored.
We provide a script to download most datasets:
python3 scripts/download_datasets.py all --n_proc 8
Downloading can be resumed if canceled or an error occurs mid-download.
Some datasets need to be manually downloaded, often due licensing agreements. See the relevant classes for their locations and download instructions. These include:
- DocQA, InfoQA, and SceneText need to be downloaded from https://rrc.cvc.uab.es.
- LVBench needs to be downloaded from https://huggingface.co/datasets/zai-org/LVBench.
- MLVU and LongVideoBench have HuggingFace user agreements that must be accepted before the download scripts will work
- The nturgbd subset of MVBench needs to be manually downloaded.
- Tracking datasets that require manual download: Ref-YT-VOS, YTVIS, ReVOS, LaSOT, Molmo2VideoTrack, and etc. See
olmo/data/academic_video_track_datasets.pyandolmo/data/molmo2_video_track_datasets.pyfor download instructions.
The download scripts will throw an error and provide instructions if those files are not found.
To download a specific dataset provide the dataset or class name as follows:
python3 scripts/download_datasets.py ChartQa --n-procs 12
You can also download by group:
# Download image academic benchmarks
python3 scripts/download_datasets.py image_academic
# Download multiple specific datasets
python3 scripts/download_datasets.py text_vqa doc_qa chart_qa
# Download video academic benchmarks
python3 scripts/download_datasets.py video_academic
# Download all video tracking datasets (MOT + SOT)
python3 scripts/download_datasets.py video_tracking --n-procs 8
Available groups: image_academic, video_academic, pixmo, image_pointing, video_pointing, video_tracking, demo.
Downloading Pretrained Models for Training from scratch
Pretrained models can be downloaded and prepared with scripts/prepare_pretrained_model.py
For example:
python scripts/prepare_pretrained_model.py qwen3_4b_instruct
python scripts/prepare_pretrained_model.py siglip2
This will download the checkpoint, convert it into a compatible format, and save a sharded version
in the location specified by the corresponding config olmo/model_configs.py for fast loading.
Visualizing Data
Once downloaded, datasets can be visualized by using the scripts/dataset_visualize.py script:
python3 scripts/dataset_visualize.py chart_qa /path/to/viz/dir
This script will build a HTML file to show what the data looks like after pre-processing.
Environment
Generally training runs should use these flags:
HF_DATASETS_OFFLINE=1
OLMO_SHARED_FS=1
HF_ACCESS_TOKEN=YOUR_HF_KEY
OPENAI_API_KEY=YOUR_OPENAI_API_KEY
WANDB_API_KEY=YOUR_WANDB_KEY
OMP_NUM_THREADS=8
HF_DATASETS_OFFLINE stops HF from sending tons of requests to the HF dataset hub even though the data
is already download.
OLMO_SHARED_FS tell the codes to assume, for multi-nodes jobs, you are saving to a shared
file system.
HF_ACCESS_TOKEN might be used to download the tokenizer, OPENAI_API_KEY might be used in some evaluations,
and WANDB_API_KEY is for wandb logging.
OMP_NUM_THREADS is for torch.
Training and Evaluations
Molmo2 training has three stages:
- Pre-Training — Train on image captioning, NLP, and image pointing using
launch_scripts/pretrain.py. Start from pretrained LLM + ViT weights. - SFT — Multitask supervised fine-tuning on the full mixture (QA, pointing, tracking, video, etc.) using
launch_scripts/sft.py. Start from a pretrained checkpoint. - Long-Context SFT — Continue SFT with longer sequences (36k+ tokens, 384 frames) for improved video understanding. Uses the same
launch_scripts/sft.pywith increased--seq_len.
Each stage produces a checkpoint that feeds into the next. We release checkpoints at each stage (see below).
Checkpoints
We release model weights after pre-training, SFT, and long-context SFT in a format compatible with this codebase. The Long-Context SFT Checkpoint matches the hugging face repo checkpoints, but have a slightly different format. The config files are backwards-compatible with this repo but might not match exactly.
<table> <tr> <th>HF Model</th> <th>Pretrained Checkpoint</th> <th>SFT Checkpoint</th> <th>Long-Context SFT Checkpoint</th> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo2-4B">Molmo2-4B</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-4B-Pretrain.tar">Pretrain</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-4B-SFT.tar">SFT</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-4B.tar">Long-Context SFT</a></td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo2-8B">Molmo2-8B</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-8B-Pretrain.tar">Pretrain</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-8B-SFT.tar">SFT</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-8B.tar">Long-Context SFT</a></td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo2-O-7B">Molmo2-O-7B</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-O-7B-Pretrain.tar">Pretrain</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-O-7B-SFT.tar">SFT</a></td> <td><a href="https://storage.googleapis