Molmo
Code for the Molmo Vision-Language Model
Install / Use
/learn @allenai/MolmoREADME
Molmo is a repository for training and using Ai2's state-of-the-art multimodal open language models.
Here is a video demo of Molmo's capabilities. Try Molmo using our public demo showcasing the Molmo-7B-D model.
This codebase is based on the OLMo codebase with the addition of vision encoding and integrating generative evaluations.
Release Notes
-
[2024/12/05] 🔥 Molmo: code for modeling, training and evaluation has been released. You can find detailed technical report here.
-
[2024/11/27] 🔥 PixMo, our new collection of datasets for pre-training and fine-tuning VLMs, has been released. PixMo consists of:
- PixMo-Cap (pre-training, fine-tuning): highly detailed dense caption dataset (roughly 200 words on average)
- PixMo-AskModelAnything (fine-tuning): instruction-tuning data containing human-authored image-question-answer triplets
- PixMo-CapQA (fine-tuning): synthetic instruction-tuning data, using a LLM to build QA pairs from dense captions of images
- PixMo-Points (fine-tuning): images paired with referring expressions and annotated points, supporting grounding and counting
- PixMo-Point-Explanations (fine-tuning): instruction-tuning data with explanations containing in-line points referring to parts of the image
- PixMo-Docs (fine-tuning): synthetic image-question-answer triplets about various kinds of computer-generated charts, tables, diagrams and documents. Code available here.
- PixMo-Clocks (fine-tuning): virtual watch faces and time annotations
- PixMo-Count (fine-tuning): diverse images with counting QA pairs
All datasets were constructed without the use of VLMs.
-
[2024/09/24] 🔥 Molmo, a new family of open VLMs, has been released. The Molmo family consists of:
- MolmoE-1B: a mixture of experts model with 1B (active) 7B (total)
- Molmo-7B-O: our most open 7B model
- Molmo-7B-D: our best 7B and demo model
- Molmo-72B: our best 72B model
Installation
We recommend using python 3.10. First install PyTorch according to the instructions specific to your operating system.
To install dependencies, run:
git clone https://github.com/allenai/molmo.git
cd molmo
pip install -e .[all]
For training and evaluating MolmoE-1B, please install megablocks by running pip install git+https://github.com/Muennighoff/megablocks.git@olmoe.
Huggingface Models and Logs
The core models in the Molmo family released so far are:
<table> <tr> <th>Model</th> <th>Vision Encoder</th> <th>LLM</th> <th align="center">11-benchmark avg</th> </tr> <tr> <td><a href="https://huggingface.co/allenai/MolmoE-1B-0924">MolmoE-1B-0924</a></td> <td rowspan="4"><a href="https://huggingface.co/openai/clip-vit-large-patch14-336">OpenAI CLIP ViT-L/14@336</a></td> <td><a href="https://huggingface.co/allenai/OLMoE-1B-7B-0924">OLMoE-1B-7B-0924</a></td> <td align="center">68.6</td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo-7B-O-0924">Molmo-7B-O-0924</a></td> <td><a href="https://huggingface.co/allenai/OLMo-7B-1024-preview">OLMo-7B-1024-preview</a></td> <td align="center">74.6</td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo-7B-D-0924">Molmo-7B-D-0924</a></td> <td><a href="https://huggingface.co/Qwen/Qwen2-7B">Qwen2-7B</a></td> <td align="center">77.3</td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo-72B-0924">Molmo-72B-0924</a></td> <td><a href="https://huggingface.co/Qwen/Qwen2-72B">Qwen2-72B</a></td> <td align="center">81.2</td> </tr> </table>W&B logs: pre-training, fine-tuning
Data Downloading and Setup
Molmo uses huggingface datasets for most data, therefore most
data will be stored in the default huggingface cache. See here
for how to set it. Some additional data is stored separately in the path
set by MOLMO_DATA_DIR.
For example, if you want to store the data in /data/molmo you could set
export MOLMO_DATA_DIR=/data/molmo
export HF_HOME=/data/molmo/huggingface
Data can then be downloaded with:
python3 scripts/download.py all --n_proc 12
Downloading the pixmo datasets requires downloading images from URLs. The download script will do this automatically, but it will take some time. Downloading everything from scratch can take up to a day. More processes can make it faster, but it also increases the risk of getting rate-limited.
Downloading can be resumed if canceled or an error occurs mid-download.
Some datasets (InfoQa and Scene-Text) require manually downloading the files. The download scripts will throw an error if those files are not found.
Downloading the android control dataset requires additional dependencies since it requires parsing the original tfrecords.
To download a specific dataset pass in the dataset name run:
python3 scripts/download_data.py ChartQa --n_proc 12
Visualizing Data
Once downloaded, datasets can be visualized by using scripts/dataset_visualize.py script:
python3 scripts/dataset_visualize.py chart_qa /path/to/viz/dir
Trained Models
We release model weights both after pre-training and after fine-tuning in a format compatible with this codebase. The fine-tuned weights match the ones in the hugging face repos, but have a slightly different format. The config files are backwards-compatible with this repo, but also have a slightly different format.
<table> <tr> <th>Model</th> <th>Pretrained</th> <th>Fine-Tuned</th> </tr> <tr> <td><a href="https://huggingface.co/allenai/MolmoE-1B-0924">MolmoE-1B-0924</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/MolmoE-1B-0924-Pretrained.tar">pretrained</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/MolmoE-1B-0924.tar">fine-tuned</a></td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo-7B-O-0924">Molmo-7B-O-0924</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-7B-O-0924-Pretrained.tar">pretrained</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-7B-O-0924.tar">fine-tuned</a></td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo-7B-D-0924">Molmo-7B-D-0924</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-7B-D-0924-Pretrained.tar">pretrained</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-7B-D-0924.tar">fine-tuned</a></td> </tr> <tr> <td><a href="https://huggingface.co/allenai/Molmo-72B-0924">Molmo-72B-0924</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-72B-0924-Pretrained.tar">pretrained</a></td> <td><a href="https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-72B-0924.tar">fine-tuned</a></td> </tr> </table>To use them, download the file and untar them. Each folder contains the needed config file and model weights. For example:
wget https://storage.googleapis.com/oe-training-public/Molmo-0924/Molmo-7B-D-0924.tar
tar -xf Molmo-7B-D-0924.tar
Evaluation
Evaluation is done with the launch_scripts/eval_downstream.py script.
FSDP can be used to evaluate large models, or for high-resolu
