<div align="center"> <img src="assets/Molmo2-logo.svg" alt="Molmo2 Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/> <br> <br> <h1>Molmo 2: State-of-the-art video understanding, pointing, and tracking</h1> </div> <p align="center"> <a href="https://github.com/allenai/molmo2/LICENSE"> <img alt="GitHub License" src="https://img.shields.io/github/license/allenai/OLMo"> </a> <a href="https://allenai.org/blog/molmo2"> <img alt="Blog Post" src="https://img.shields.io/badge/Molmo2-blog-F0529C"> </a> <a href="https://arxiv.org/abs/2601.10611"> <img alt="Paper URL" src="https://img.shields.io/badge/arxiv-2601.10611-blue"> </a> <a href="https://huggingface.co/collections/allenai/molmo2"> <img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-yellow"> </a> <a href="https://huggingface.co/collections/allenai/molmo2-data"> <img alt="Molmo2 Datasets" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Datasets-yellow"> </a> </p>

This repository is for training and using Ai2's open vision language models, Molmo2 and MolmoPoint. Molmo2 is state-of-the-art among open-source models and demonstrates exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks as shown below. MolmoPoint is an extension with a new architecture for pointing. This README is mostly concerned with Molmo2, see MolmoPoint for how to train MolmoPoint.

See our blog post or our paper for more details about Molmo2. Huggingface models can be found here.

Setup
Training and Evaluations
Transformers and vLLM
Code

Setup

Installation

We recommend using python >= 3.11 First install PyTorch according to the instructions specific to your operating system.

To install dependencies, run:

git clone https://github.com/allenai/molmo2.git
cd molmo2
pip install torchcodec
pip install -e .[all]

It's recommended to install torchcodec separately since it has some complex dependencies that can break if installed in combination with the others as done using install -e .[all]

Docker

We provide a container with the dependencies (but not the code) pre-installed, pull it with: docker pull ghcr.io/allenai/molmo2:latest

Downloading Data

Molmo2 uses a mix of huggingface datasets and custom data stored in MOLMO_DATA_DIR.

For example, if you want to store the data in /data/molmo you could set

export MOLMO_DATA_DIR=/data/molmo
export HF_HOME=/data/molmo/huggingface

See here for more info on where the huggingface data is stored.

We provide a script to download most datasets:

python3 scripts/download_datasets.py all --n_proc 8

Downloading can be resumed if canceled or an error occurs mid-download.

Some datasets need to be manually downloaded, often due licensing agreements. See the relevant classes for their locations and download instructions. These include:

DocQA, InfoQA, and SceneText need to be downloaded from https://rrc.cvc.uab.es.
LVBench needs to be downloaded from https://huggingface.co/datasets/zai-org/LVBench.
MLVU and LongVideoBench have HuggingFace user agreements that must be accepted before the download scripts will work
The nturgbd subset of MVBench needs to be manually downloaded.
Tracking datasets that require manual download: Ref-YT-VOS, YTVIS, ReVOS, LaSOT, Molmo2VideoTrack, and etc. See olmo/data/academic_video_track_datasets.py and olmo/data/molmo2_video_track_datasets.py for download instructions.

The download scripts will throw an error and provide instructions if those files are not found.

To download a specific dataset provide the dataset or class name as follows:

python3 scripts/download_datasets.py ChartQa --n-procs 12

You can also download by group:

# Download image academic benchmarks
python3 scripts/download_datasets.py image_academic

# Download multiple specific datasets
python3 scripts/download_datasets.py text_vqa doc_qa chart_qa

# Download video academic benchmarks
python3 scripts/download_datasets.py video_academic

# Download all video tracking datasets (MOT + SOT)
python3 scripts/download_datasets.py video_tracking --n-procs 8

Available groups: image_academic, video_academic, pixmo, image_pointing, video_pointing, video_tracking, demo.

Downloading Pretrained Models for Training from scratch

Pretrained models can be downloaded and prepared with scripts/prepare_pretrained_model.py

For example:

python scripts/prepare_pretrained_model.py qwen3_4b_instruct
python scripts/prepare_pretrained_model.py siglip2

This will download the checkpoint, convert it into a compatible format, and save a sharded version in the location specified by the corresponding config olmo/model_configs.py for fast loading.

Visualizing Data

Once downloaded, datasets can be visualized by using the scripts/dataset_visualize.py script:

python3 scripts/dataset_visualize.py chart_qa /path/to/viz/dir

This script will build a HTML file to show what the data looks like after pre-processing.

Environment

Generally training runs should use these flags:

HF_DATASETS_OFFLINE=1
OLMO_SHARED_FS=1
HF_ACCESS_TOKEN=YOUR_HF_KEY
OPENAI_API_KEY=YOUR_OPENAI_API_KEY
WANDB_API_KEY=YOUR_WANDB_KEY
OMP_NUM_THREADS=8

HF_DATASETS_OFFLINE stops HF from sending tons of requests to the HF dataset hub even though the data is already download.

OLMO_SHARED_FS tell the codes to assume, for multi-nodes jobs, you are saving to a shared file system.

HF_ACCESS_TOKEN might be used to download the tokenizer, OPENAI_API_KEY might be used in some evaluations, and WANDB_API_KEY is for wandb logging.

OMP_NUM_THREADS is for torch.

Training and Evaluations

Molmo2 training has three stages:

Pre-Training — Train on image captioning, NLP, and image pointing using launch_scripts/pretrain.py. Start from pretrained LLM + ViT weights.
SFT — Multitask supervised fine-tuning on the full mixture (QA, pointing, tracking, video, etc.) using launch_scripts/sft.py. Start from a pretrained checkpoint.
Long-Context SFT — Continue SFT with longer sequences (36k+ tokens, 384 frames) for improved video understanding. Uses the same launch_scripts/sft.py with increased --seq_len.

Each stage produces a checkpoint that feeds into the next. We release checkpoints at each stage (see below).

Checkpoints

We release model weights after pre-training, SFT, and long-context SFT in a format compatible with this codebase. The Long-Context SFT Checkpoint matches the hugging face repo checkpoints, but have a slightly different format. The config files are backwards-compatible with this repo but might not match exactly.

<table> <tr> <th>HF Model</th> <th>Pretrained Checkpoint</th> <th>SFT Checkpoint</th> <th>Long-Context SFT Checkpoint</th> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo2-4B">Molmo2-4B</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-4B-Pretrain.tar">Pretrain</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-4B-SFT.tar">SFT</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-4B.tar">Long-Context SFT</a></td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo2-8B">Molmo2-8B</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-8B-Pretrain.tar">Pretrain</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-8B-SFT.tar">SFT</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-8B.tar">Long-Context SFT</a></td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo2-O-7B">Molmo2-O-7B</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-O-7B-Pretrain.tar">Pretrain</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-O-7B-SFT.tar">SFT</a></td> <td><a href="https://storage.googleapis

Molmo2

Install / Use

README

Table of Contents