SkillAgentSearch skills...

Molmo2

Code for the Molmo2 Vision-Language Model

Install / Use

/learn @allenai/Molmo2
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <img src="assets/Molmo2-logo.svg" alt="Molmo2 Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/> <br> <br> <h1>Molmo 2: State-of-the-art video understanding, pointing, and tracking</h1> </div> <p align="center"> <a href="https://github.com/allenai/molmo2/LICENSE"> <img alt="GitHub License" src="https://img.shields.io/github/license/allenai/OLMo"> </a> <a href="https://allenai.org/blog/molmo2"> <img alt="Blog Post" src="https://img.shields.io/badge/Molmo2-blog-F0529C"> </a> <a href="https://arxiv.org/abs/2601.10611"> <img alt="Paper URL" src="https://img.shields.io/badge/arxiv-2601.10611-blue"> </a> <a href="https://huggingface.co/collections/allenai/molmo2"> <img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-yellow"> </a> <a href="https://huggingface.co/collections/allenai/molmo2-data"> <img alt="Molmo2 Datasets" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Datasets-yellow"> </a> </p>

This repository is for training and using Ai2's open vision language models, Molmo2 and MolmoPoint. Molmo2 is state-of-the-art among open-source models and demonstrates exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks as shown below. MolmoPoint is an extension with a new architecture for pointing. This README is mostly concerned with Molmo2, see MolmoPoint for how to train MolmoPoint.

<div align="center"> <img src="assets/molmo2_capabilities.png" alt="Molmo2 Capabilites" width="1200" style="margin-left:'auto' margin-right:'auto' display:'block'"/> </div>

See our blog post or our paper for more details about Molmo2. Huggingface models can be found here.

Table of Contents

Setup

Installation

We recommend using python >= 3.11 First install PyTorch according to the instructions specific to your operating system.

To install dependencies, run:

git clone https://github.com/allenai/molmo2.git
cd molmo2
pip install torchcodec
pip install -e .[all]

It's recommended to install torchcodec separately since it has some complex dependencies that can break if installed in combination with the others as done using install -e .[all]

Docker

We provide a container with the dependencies (but not the code) pre-installed, pull it with: docker pull ghcr.io/allenai/molmo2:latest

Downloading Data

Molmo2 uses a mix of huggingface datasets and custom data stored in MOLMO_DATA_DIR.

For example, if you want to store the data in /data/molmo you could set

export MOLMO_DATA_DIR=/data/molmo
export HF_HOME=/data/molmo/huggingface

See here for more info on where the huggingface data is stored.

We provide a script to download most datasets:

python3 scripts/download_datasets.py all --n_proc 8

Downloading can be resumed if canceled or an error occurs mid-download.

Some datasets need to be manually downloaded, often due licensing agreements. See the relevant classes for their locations and download instructions. These include:

  • DocQA, InfoQA, and SceneText need to be downloaded from https://rrc.cvc.uab.es.
  • LVBench needs to be downloaded from https://huggingface.co/datasets/zai-org/LVBench.
  • MLVU and LongVideoBench have HuggingFace user agreements that must be accepted before the download scripts will work
  • The nturgbd subset of MVBench needs to be manually downloaded.
  • Tracking datasets that require manual download: Ref-YT-VOS, YTVIS, ReVOS, LaSOT, Molmo2VideoTrack, and etc. See olmo/data/academic_video_track_datasets.py and olmo/data/molmo2_video_track_datasets.py for download instructions.

The download scripts will throw an error and provide instructions if those files are not found.

To download a specific dataset provide the dataset or class name as follows:

python3 scripts/download_datasets.py ChartQa --n-procs 12

You can also download by group:

# Download image academic benchmarks
python3 scripts/download_datasets.py image_academic

# Download multiple specific datasets
python3 scripts/download_datasets.py text_vqa doc_qa chart_qa

# Download video academic benchmarks
python3 scripts/download_datasets.py video_academic

# Download all video tracking datasets (MOT + SOT)
python3 scripts/download_datasets.py video_tracking --n-procs 8

Available groups: image_academic, video_academic, pixmo, image_pointing, video_pointing, video_tracking, demo.

Downloading Pretrained Models for Training from scratch

Pretrained models can be downloaded and prepared with scripts/prepare_pretrained_model.py

For example:

python scripts/prepare_pretrained_model.py qwen3_4b_instruct
python scripts/prepare_pretrained_model.py siglip2

This will download the checkpoint, convert it into a compatible format, and save a sharded version in the location specified by the corresponding config olmo/model_configs.py for fast loading.

Visualizing Data

Once downloaded, datasets can be visualized by using the scripts/dataset_visualize.py script:

python3 scripts/dataset_visualize.py chart_qa /path/to/viz/dir

This script will build a HTML file to show what the data looks like after pre-processing.

Environment

Generally training runs should use these flags:

HF_DATASETS_OFFLINE=1
OLMO_SHARED_FS=1
HF_ACCESS_TOKEN=YOUR_HF_KEY
OPENAI_API_KEY=YOUR_OPENAI_API_KEY
WANDB_API_KEY=YOUR_WANDB_KEY
OMP_NUM_THREADS=8

HF_DATASETS_OFFLINE stops HF from sending tons of requests to the HF dataset hub even though the data is already download.

OLMO_SHARED_FS tell the codes to assume, for multi-nodes jobs, you are saving to a shared file system.

HF_ACCESS_TOKEN might be used to download the tokenizer, OPENAI_API_KEY might be used in some evaluations, and WANDB_API_KEY is for wandb logging.

OMP_NUM_THREADS is for torch.

Training and Evaluations

Molmo2 training has three stages:

  1. Pre-Training — Train on image captioning, NLP, and image pointing using launch_scripts/pretrain.py. Start from pretrained LLM + ViT weights.
  2. SFT — Multitask supervised fine-tuning on the full mixture (QA, pointing, tracking, video, etc.) using launch_scripts/sft.py. Start from a pretrained checkpoint.
  3. Long-Context SFT — Continue SFT with longer sequences (36k+ tokens, 384 frames) for improved video understanding. Uses the same launch_scripts/sft.py with increased --seq_len.

Each stage produces a checkpoint that feeds into the next. We release checkpoints at each stage (see below).

Checkpoints

We release model weights after pre-training, SFT, and long-context SFT in a format compatible with this codebase. The Long-Context SFT Checkpoint matches the hugging face repo checkpoints, but have a slightly different format. The config files are backwards-compatible with this repo but might not match exactly.

<table> <tr> <th>HF Model</th> <th>Pretrained Checkpoint</th> <th>SFT Checkpoint</th> <th>Long-Context SFT Checkpoint</th> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo2-4B">Molmo2-4B</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-4B-Pretrain.tar">Pretrain</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-4B-SFT.tar">SFT</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-4B.tar">Long-Context SFT</a></td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo2-8B">Molmo2-8B</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-8B-Pretrain.tar">Pretrain</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-8B-SFT.tar">SFT</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-8B.tar">Long-Context SFT</a></td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo2-O-7B">Molmo2-O-7B</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-O-7B-Pretrain.tar">Pretrain</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo2-1225/Molmo2-O-7B-SFT.tar">SFT</a></td> <td><a href="https://storage.googleapis
View on GitHub
GitHub Stars474
CategoryDevelopment
Updated1h ago
Forks31

Languages

Python

Security Score

95/100

Audited on Mar 31, 2026

No findings