<div align="center"> <img src="assets/Molmo-logo.svg" alt="Molmo Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/> <br> <br> <h1>Molmo: Multimodal Open Language Model</h1> </div> <p align="center"> <a href="https://github.com/allenai/mm_olmo/blob/release/LICENSE"> <img alt="GitHub License" src="https://img.shields.io/github/license/allenai/OLMo"> </a> <a href="https://molmo.allenai.org/blog"> <img alt="Blog Post" src="https://img.shields.io/badge/Molmo-blog-F0529C"> </a> <a href="https://arxiv.org/pdf/2409.17146"> <img alt="Paper URL" src="https://img.shields.io/badge/arxiv-2409.17146-blue"> </a> <a href="https://huggingface.co/collections/allenai/molmo-66f379e6fe3b8ef090a8ca19"> <img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-yellow"> </a> <a href="https://huggingface.co/collections/allenai/pixmo-674746ea613028006285687b"> <img alt="PixMo (Datasets)" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-PixMo (Datasets)-yellow"> </a> </p>

Molmo is a repository for training and using Ai2's state-of-the-art multimodal open language models.

Here is a video demo of Molmo's capabilities. Try Molmo using our public demo showcasing the Molmo-7B-D model.

This codebase is based on the OLMo codebase with the addition of vision encoding and integrating generative evaluations.

Release Notes

[2024/12/05] 🔥 Molmo: code for modeling, training and evaluation has been released. You can find detailed technical report here.
[2024/11/27] 🔥 PixMo, our new collection of datasets for pre-training and fine-tuning VLMs, has been released. PixMo consists of:
- PixMo-Cap (pre-training, fine-tuning): highly detailed dense caption dataset (roughly 200 words on average)
- PixMo-AskModelAnything (fine-tuning): instruction-tuning data containing human-authored image-question-answer triplets
- PixMo-CapQA (fine-tuning): synthetic instruction-tuning data, using a LLM to build QA pairs from dense captions of images
- PixMo-Points (fine-tuning): images paired with referring expressions and annotated points, supporting grounding and counting
- PixMo-Point-Explanations (fine-tuning): instruction-tuning data with explanations containing in-line points referring to parts of the image
- PixMo-Docs (fine-tuning): synthetic image-question-answer triplets about various kinds of computer-generated charts, tables, diagrams and documents. Code available here.
- PixMo-Clocks (fine-tuning): virtual watch faces and time annotations
- PixMo-Count (fine-tuning): diverse images with counting QA pairs
All datasets were constructed without the use of VLMs.

<div align="center"> <img src="assets/png_version_molmo_pixmo.png" alt="Pixmo and Molmo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/> <br> <p>Datasets in PixMo (left) and the capabilities they enable in Molmo (right). <br> </div>

[2024/09/24] 🔥 Molmo, a new family of open VLMs, has been released. The Molmo family consists of:
- MolmoE-1B: a mixture of experts model with 1B (active) 7B (total)
- Molmo-7B-O: our most open 7B model
- Molmo-7B-D: our best 7B and demo model
- Molmo-72B: our best 72B model

Installation

We recommend using python 3.10. First install PyTorch according to the instructions specific to your operating system.

To install dependencies, run:

git clone https://github.com/allenai/molmo.git
cd molmo
pip install -e .[all]

For training and evaluating MolmoE-1B, please install megablocks by running pip install git+https://github.com/Muennighoff/megablocks.git@olmoe.

Huggingface Models and Logs

The core models in the Molmo family released so far are:

<table> <tr> <th>Model</th> <th>Vision Encoder</th> <th>LLM</th> <th align="center">11-benchmark avg</th> </tr> <tr> <td><a href="https://huggingface.co/allenai/MolmoE-1B-0924">MolmoE-1B-0924</a></td> <td rowspan="4"><a href="https://huggingface.co/openai/clip-vit-large-patch14-336">OpenAI CLIP ViT-L/14@336</a></td> <td><a href="https://huggingface.co/allenai/OLMoE-1B-7B-0924">OLMoE-1B-7B-0924</a></td> <td align="center">68.6</td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo-7B-O-0924">Molmo-7B-O-0924</a></td> <td><a href="https://huggingface.co/allenai/OLMo-7B-1024-preview">OLMo-7B-1024-preview</a></td> <td align="center">74.6</td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo-7B-D-0924">Molmo-7B-D-0924</a></td> <td><a href="https://huggingface.co/Qwen/Qwen2-7B">Qwen2-7B</a></td> <td align="center">77.3</td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo-72B-0924">Molmo-72B-0924</a></td> <td><a href="https://huggingface.co/Qwen/Qwen2-72B">Qwen2-72B</a></td> <td align="center">81.2</td> </tr> </table>

W&B logs: pre-training, fine-tuning

Data Downloading and Setup

Molmo uses huggingface datasets for most data, therefore most data will be stored in the default huggingface cache. See here for how to set it. Some additional data is stored separately in the path set by MOLMO_DATA_DIR.

For example, if you want to store the data in /data/molmo you could set

export MOLMO_DATA_DIR=/data/molmo
export HF_HOME=/data/molmo/huggingface

Data can then be downloaded with:

python3 scripts/download.py all --n_proc 12

Downloading the pixmo datasets requires downloading images from URLs. The download script will do this automatically, but it will take some time. Downloading everything from scratch can take up to a day. More processes can make it faster, but it also increases the risk of getting rate-limited.

Downloading can be resumed if canceled or an error occurs mid-download.

Some datasets (InfoQa and Scene-Text) require manually downloading the files. The download scripts will throw an error if those files are not found.

Downloading the android control dataset requires additional dependencies since it requires parsing the original tfrecords.

To download a specific dataset pass in the dataset name run:

python3 scripts/download_data.py ChartQa --n_proc 12

Visualizing Data

Once downloaded, datasets can be visualized by using scripts/dataset_visualize.py script:

python3 scripts/dataset_visualize.py chart_qa /path/to/viz/dir

Trained Models

We release model weights both after pre-training and after fine-tuning in a format compatible with this codebase. The fine-tuned weights match the ones in the hugging face repos, but have a slightly different format. The config files are backwards-compatible with this repo, but also have a slightly different format.

<table> <tr> <th>Model</th> <th>Pretrained</th> <th>Fine-Tuned</th> </tr> <tr> <td><a href="https://huggingface.co/allenai/MolmoE-1B-0924">MolmoE-1B-0924</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/MolmoE-1B-0924-Pretrained.tar">pretrained</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/MolmoE-1B-0924.tar">fine-tuned</a></td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo-7B-O-0924">Molmo-7B-O-0924</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-7B-O-0924-Pretrained.tar">pretrained</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-7B-O-0924.tar">fine-tuned</a></td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo-7B-D-0924">Molmo-7B-D-0924</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-7B-D-0924-Pretrained.tar">pretrained</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-7B-D-0924.tar">fine-tuned</a></td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo-72B-0924">Molmo-72B-0924</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-72B-0924-Pretrained.tar">pretrained</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-72B-0924.tar">fine-tuned</a></td> </tr> </table>

To use them, download the file and untar them. Each folder contains the needed config file and model weights. For example:

wget https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-7B-D-0924.tar
tar -xf Molmo-7B-D-0924.tar

Evaluation

Evaluation is done with the launch_scripts/eval_downstream.py script. FSDP can be used to evaluate large models, or for high-resolu

Molmo

Install / Use

README