VideoMolmo

Official code of the paper "VideoMolmo: Spatio-Temporal Grounding meets Pointing"

Generate Convert Improve

Install / Use

/learn @mbzuai-oryx/VideoMolmo

About this skill

Quality Score

0/100

README

VideoMolmo: Spatio-Temporal Grounding meets Pointing

<div align="left" style="margin:24px 0;"> <img src="https://user-images.githubusercontent.com/74038190/212284115-f47cd8ff-2ffb-4b04-b5bf-4d1c14c0247f.gif" width="100%" height="4"/> </div> <a href="https://mbzuai-oryx.github.io/VideoMolmo/"><img src="https://img.shields.io/badge/Project-Website-87CEEB?style=flat-square" alt="Website"></a> <a href="https://arxiv.org/abs/2506.05336"><img src="https://img.shields.io/badge/arXiv-Paper-brightgreen?style=flat-square" alt="arXiv"></a> <a href="https://huggingface.co/datasets/ghazishazan/VPoS"><img src="https://img.shields.io/badge/🤗_Dataset-Access-green" alt="dataset"></a> <a href="https://huggingface.co/ghazishazan/VideoMolmo"><img src="https://img.shields.io/badge/HuggingFace-Model-F9D371" alt="model"></a> <a href="https://colab.research.google.com/drive/1gqg5kBP9MYkdarEry7QS5rJFQYOG7DiF?usp=sharing"><img src="https://img.shields.io/badge/Run-Colab-orange?style=flat-square&logo=google-colab" alt="Colab"></a> <a href="https://github.com/khufia">Ghazi Shazan Ahmad</a>*, <a href="https://scholar.google.com/citations?user=JcWO9OUAAAAJ&hl=en">Ahmed Heakl</a>*, <a href="https://hananshafi.github.io/">Hanan Gani</a>, <a href="https://amshaker.github.io/">Abdelrahman Shaker</a>, <a href="https://zhiqiangshen.com/">Zhiqiang Shen</a>, <a href="https://scholar.google.es/citations?user=zvaeYnUAAAAJ&hl=en">Fahad Shahbaz Khan</a>, <a href="https://salman-h-khan.github.io/">Salman Khan</a> MBZUAI · Linköping University · ANU *Equal Technical Contributions

🆕 Latest Updates

📢 June 2025: Colab Notebook with our bidirectional inference method is released!
📢 May 2025: Paper and inference code are released!

📊 Overview

VideoMolmo is a a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, \method substantially improves spatio-temporal pointing accuracy and reasoning capability.

🏆 Highlights

Key contributions of VideoMolmo:

We introduce VideoMolmo , an LMM that accepts natural-language queries and produces point-level predictions for target objects across entire video sequences, ensuring temporal consistency.
We further introduce Temporal module to leverage past temporal context and propose a novel temporal mask fusion pipeline for enhanced temporal coherence.
To achieve fine-grained spatio-temporal pointing, we introduce a comprehensive dataset of 72k video-caption pairs and 100k object points.
To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also assess our model on Referring Video Object Segmentation (Ref-VOS) and Reasoning VOS tasks.

🧠 Architecture

VideoMolmo consists of four end-to-end trainable components: (1) a visual encoder, (2) a temporal module, (3) visual projector (4) a decoder-only large language model (LLM); and a post-processing module.

🏗️ Benchmark and Annotation Pipeline

We propose a semi-automatic annotation pipeline for creating a grounded conversation generation (GCG) dataset for videos.

📈 Results

|1| VideoMolmo demonstrates robust generalization and fine-grained spatio-temporal grounding across diverse out-of-distribution scenarios from our proposed benchmark, for instance, correctly pointing to traffic lights (2nd row) in challenging driving scenes despite never encountering such scenarios during training.

|2| Quantative results showing VideoMolmo with average improvement of 4.1% over SoTA (VideoGLaMM) and 4.8% over our baseline (Molmo+SAM2).

🔧 Running VideoMolmo

Environment setup

(1) Setup environment and PyTorch

git clone https://github.com/mbzuai-oryx/VideoMolmo
cd VideoMolmo/VideoMolmo
conda create -n .videomolmo python=3.10 -y
conda activate .videomolmo
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu121

(2) Setup Molmo

git clone https://github.com/allenai/molmo.git
cd molmo && pip install -e .[all] && cd .. # setup molmo requirements
pip install -r requirements.txt

(3) Setup SAM

python setup.py build_ext --inplace # build sam2
mkdir -p sam2_checkpoints
wget https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt -O sam2_checkpoints/sam2.1_hiera_large.pt

🔄 Inference

To run inference on the provided sample video:

python infer.py \
  --video_path ../examples/video_sample1 \
  --prompt "point to the person in red shirt" \
  --save_path "results"

Your video should be a folder with all the frames. Sample structure:

video_sample1/
├── frame_0001.jpg
├── frame_0002.jpg
├── frame_0003.jpg
└── ...

Output includes segmentation masks for each frame and a JSON file (points.jsonl) containing point coordinates.

reuslts/
├── video_sample1/
│   ├── frame_0001.jpg
│   ├── frame_0002.jpg
│   ├── frame_0003.jpg
│   ├── points.jsonl
│   └── ...
└── ...

Training and Evaluation 🚀

To be released soon! Stay tuned for updates.

Todos

[ ] Release training and evaluation scripts.
[ ] Add support for additional datasets.
[ ] Release dataset creation pipeline.

Citation 📜

  @misc{ahmad2025videomolmospatiotemporalgroundingmeets,
      title={VideoMolmo: Spatio-Temporal Grounding Meets Pointing},
      author={Ghazi Shazan Ahmad and Ahmed Heakl and Hanan Gani and Abdelrahman Shaker and Zhiqiang Shen and Ranjay Krishna and Fahad Shahbaz Khan and Salman Khan},
      year={2025},
      eprint={2506.05336},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
url={https://arxiv.org/abs/2506.05336},
}

Related Skills

qqbot-channel

343.1k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

99.7k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

343.1k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

ddd

Guía de Principios DDD para el Proyecto > 📚 Documento Complementario : Este documento define los principios y reglas de DDD. Para ver templates de código, ejemplos detallados y guías paso

mbzuai-oryx

View profile

View on GitHub

GitHub Stars53

CategoryContent

Updated3mo ago

Forks3

mbzuai-oryx/VideoMolmo

Languages

Python

Security Score

77/100

Audited on Dec 15, 2025

No findings