<div align="center"> <img src="assets/molmoact_logo.svg" alt="MolmoAct Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/> <br> <br> <h1>MolmoAct: Multimodal Open Language Model for Action</h1> </div> <p align="center"> <a href="https://github.com/allenai/MolmoAct/blob/release/LICENSE"> <img alt="GitHub License" src="https://img.shields.io/github/license/allenai/OLMo"> </a> <a href="https://allenai.org/blog/molmoact"> <img alt="Blog Post" src="https://img.shields.io/badge/MolmoAct-Blog-F0529C"> </a> <a href="https://arxiv.org/abs/2508.07917"> <img alt="Paper URL" src="https://img.shields.io/badge/arXiv-2508.07917-red?logo=arxiv"> </a> <a href="https://huggingface.co/collections/allenai/molmoact-689697591a3936fba38174d7"> <img alt="Model Checkpoints" src="https://img.shields.io/badge/HF-Models-yellow?logo=huggingface"> </a> <a href="https://huggingface.co/collections/allenai/molmoact-data-mixture-6897e583e13b6c2cf3ea2b80"> <img alt="Datasets" src="https://img.shields.io/badge/HF-Datasets-yellow?logo=huggingface"> </a> </p>

Updates

[2025/12/5] 🔥 Tips on zero-shot evaluation of allenai/MolmoAct-7B-D-0812 on Franka setup with MolmoAct mid-training at 5.4 Zero-shot evaluation.
[2025/11/30] 🔥 Code for steering experiment of MolmoAct in SimplerEnv has been released at 5.3 Steer-SimplerEnv.
[2025/10/24] 🔥 Code for fine-tuning and data processing have been released! Everything is fully open-source.
[2025/08/30] 🔥 Code for replicating MolmoAct's training pipeline has been released
[2025/08/15] 🔥 Code for MolmoAct Evaluation on SimplerEnv has been released at allenai/SimplerEnv
[2025/08/12] 🔥 Datasets used for our pre-training and mid-training have been released
[2025/08/12] 🔥 Models have been released

Overview
Release Notes
2.1 Datasets
2.2 Models
Installation
Training
4.1 Train Your Own MolmoAct
4.1.1 Data Processing
4.1.2 Fine-tuning (Post-training)
4.1.3 Merge LoRA
4.1.4 Inference
4.1.5 Visualization
4.2 Training Replication
4.2.1 Pre-training
4.2.2 Mid-training
4.2.3 Post-training (LIBERO)
Evaluation
5.1 SimplerEnv
5.2 LIBERO
5.3 Steer-SimplerEnv
5.4 Real-world
License and Use
Model and Hardware Safety
Citation
Contacts

1. Overview

MolmoAct is a repository for training and using Ai2’s open-sourced Action Reasoning Model that can reason in space.

2. Release Notes

2.1 Datasets

| Data | Description | Dataset Path | |------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------| | MolmoAct Dataset | MolmoAct dataset in LeRobot format. All contents were collected in-house by Ai2. | https://huggingface.co/datasets/allenai/MolmoAct-Dataset | | MolmoAct Pre-training Mixture | Data mixture for MolmoAct pre-training. Contains a subset of OXE formulated as Action Reasoning data, auxiliary robot data, and web data. | https://huggingface.co/datasets/allenai/MolmoAct-Pretraining-Mixture | | MolmoAct Mid-training Mixture | Data mixture for MolmoAct mid-training. Contains MolmoAct Dataset formulated as Action Reasoning data. | https://huggingface.co/datasets/allenai/MolmoAct-Midtraining-Mixture |

2.2 Models

| Model | Use Case | Description | Checkpoint Path | |----------------------------|--------------|-------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------| | MolmoAct-7B-D | Fine-tuning | Best/demo MolmoAct; adapt to real robots by fine-tuning on your datasets. | https://huggingface.co/allenai/MolmoAct-7B-D-0812 | | MolmoAct-7B-O | Fine-tuning | Most open MolmoAct; adapt to real robots by fine-tuning on your datasets. | https://huggingface.co/allenai/MolmoAct-7B-O-0812 | | MolmoAct-7B-D-Pretrain | Inference | Checkpoint to replicate zero-shot results on SimplerEnv (Google Robot). | https://huggingface.co/allenai/MolmoAct-7B-D-Pretrain-0812 | | MolmoAct-7B-D-Pretrain-RT-1| Inference | Checkpoint to replicate RT-1 fine-tuned results on SimplerEnv (Google Robot). | https://huggingface.co/allenai/MolmoAct-7B-D-Pretrain-RT-1-0812|

3. Installation

We provide the Dockerfile to build the docker, where we ran all our training experiments on. We strongly recommand to build the same docker on your own and run training on that.

If you want to install environment on your own, first install python 3.11, then install PyTorch according to the instructions specific to your operating system.

Next, in both cases, go to your working molmoact folder, and run:

git clone https://github.com/allenai/molmoact.git
cd molmoact
pip install -e .[all]

4. Training

We provide instructions on both how to train your own MolmoAct and how to replicate all of our training stages:

4.1 Train Your Own MolmoAct

4.1.1 Data Processing

Installation for Data Processing

Command

git clone https://github.com/DepthAnything/Depth-Anything-V2.git
cd Depth-Anything-V2 &&
pip install -r requirements.txt &&
pip uninstall -y opencv-python opencv-python-headless opencv-contrib-python &&
pip install opencv-python-headless --no-cache-dir &&
pip install lerobot==0.3.3

Download Depth Anything V2 Checkpoint

Command

wget https://huggingface.co/allenai/MolmoAct-7B-D-0812/resolve/main/depth_anything_v2_vitb.pth
mv <path/to/depth_anything_v2_vitb.pth> <path/to/Depth-Anything-V2/checkpoints>

Download MolmoAct VQVAE Checkpoint

Command

wget https://huggingface.co/allenai/MolmoAct-7B-D-0812/resolve/main/vae-final.pt

To preprocess conventional lerobot dataset format into Action Reasoning Data, first run the preprocessing command:

Command

export DEPTH_CHECKPOINT_DIR="<path/to/Depth-Anything-V2/checkpoints>"
export VQVAE_MODEL_PATH="<path/to/vqvae.pt>"
python preprocess/action_reasoning_data.py \
--dataset-path <lerobot/repo_id> \
--output-path <path/to/processed_dataset> \
--depth-encoder vitb \
--line-length 5 \
--process-actions \
--action-bins 256 \
--action-chunk-size 8

4.1.2 Fine-tuning (Post-training)

Note that after you finished the data processing before, you should get a folder /path/to/processed_dataset where it has all the data and dataset_statistics.json. Then, you need to change finetune:/path/to/processed_dataset with the actual path in launch_scripts/train_multitask_model.py. To run the training, the following script is provided, which should work well on 8 A100/H100 GPUs. You should customize the gloabal batch size under your GPU setup to avoid OOM.

WANDB_API_KEY=<your_wandb_api_key> torchrun \
    --nnodes=1 --nproc-per-node=8 \
    --node_rank="${RANK}" --master_addr="${ADDR}" --master_port="${PORT}" \
    launch_scripts/train_multitask_model.py \
    robot-finetune allenai/MolmoAct-7B-D-0812 \
    --wandb.name=<name> --wandb.entity=<entity> --wandb.project=<project>  \
    --norm_stats_path /path/to/dataset_statistics.json \
    --save_folder=checkpoints/<exp_name> \
    --save_overwrite \
    --duration 10000 \
    --ft_embedding all \
    --depth_tokens \
    --global_batch_size 16 \
    --lr_connector 5e-4 \
    --lr_vit 5e-4 \
    --lr_llm 5e-4 \
    --save_interval 2000 \
    --save_num_checkpoints_to_keep 5 \
    --max_images 2 \
    --lora_enable --lora_rank 32 --lora_alpha 16 --lora_dropout 0.0 \
    --img_aug

Note that during fine-tuning, we by default disable the high-resolution crops by downsizing all training images to sizes smaller than 378x378, as all of our training stages doesn't enable this feature. For more details on these flags, please refer to section 4.2 Training Replication.

4.1.3 Merge LoRA

If you perform LoRA fine-tuning instead of full-parameter fine-tuning, which is what we did for most of our post-training experiments, we need to merge the adapters with the original model weights. When training with LoRA, Our checkpointer will save sharded checkpoints and LoRA adapters (named with stepXXX and `s

Molmoact

Install / Use

README

Updates

Table of Contents

1. Overview

2. Release Notes

2.1 Datasets

2.2 Models

3. Installation

4. Training

4.1 Train Your Own MolmoAct

4.1.1 Data Processing

4.1.2 Fine-tuning (Post-training)

4.1.3 Merge LoRA