<div align="center"> <h2>🤖 Magma: A Foundation Model for Multimodal AI Agents</h2>

Jianwei Yang<sup>*</sup><sup>1</sup><sup>†</sup> Reuben Tan<sup>1</sup><sup>†</sup> Qianhui Wu<sup>1</sup><sup>†</sup> Ruijie Zheng<sup>2</sup><sup>‡</sup> Baolin Peng<sup>1</sup><sup>‡</sup> Yongyuan Liang<sup>2</sup><sup>‡</sup>

Yu Gu<sup>1</sup> Mu Cai<sup>3</sup> Seonghyeon Ye<sup>4</sup> Joel Jang<sup>5</sup> Yuquan Deng<sup>5</sup> Lars Liden<sup>1</sup> Jianfeng Gao<sup>1</sup><sup>▽</sup>

<sup>1</sup> Microsoft Research; <sup>2</sup> University of Maryland; <sup>3</sup> University of Wisconsin-Madison
<sup>4</sup> KAIST; <sup>5</sup> University of Washington

<sup>*</sup> Project lead <sup>†</sup> First authors <sup>‡</sup> Second authors <sup>▽</sup> Leadership

<h3 style="color:#b22222;"> CVPR 2025 </h3> <h4> <a href="https://www.arxiv.org/pdf/2502.13130">📄 arXiv Paper</a>   <a href="https://microsoft.github.io/Magma/">🌐 Project Page</a>   <a href="https://huggingface.co/microsoft/Magma-8B">🤗 Hugging Face Model</a> <a href="https://ai.azure.com/explore/models/microsoft-magma-8b/version/1/registry/HuggingFace?tid=72f988bf-86f1-41af-91ab-2d7cd011db47">☁️ Azure AI Foundry</a> <a href="https://www.youtube.com/watch?v=SbfzvUU5yM8">📺 Video</a> </h4>  </div> <div align="center"> <p2>The Path Towards Multimodal AI Agents</p2> <img src="assets/images/magma_teaser.png?raw=true" width="100%"> </div> </div>

:sparkles: Highlights

Digital and Physical Worlds: Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!
Versatile Capabilities: Magma as a single model not only possesses generic image and videos understanding ability, but also generate goal-driven visual plans and actions, making it versatile for different agentic tasks!
State-of-the-art Performance: Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!
Scalable Pretraining Strategy: Magma is designed to be learned scalably from unlabeled videos in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!

:fire: News

[2025.04.29] Mind2Web and AITW with SoM prompting annotations are released on hugging face! We used them for our Magma downstream finetuning and reported the results in our table.
[2025.04.12] 🔥We released the pretraining videos with visual traces on hugging face Magma-Video-ToM.
[2025.04.06] Open X-Embodiment pretraining data with visual traces can be downloaded from Magma-OXE-ToM.
[2025.03.16] We released the demo code for generating SoM and ToM for instructional videos (i.e., Alg. 2 in our paper) in SoM/ToM Generation.
[2025.03.09] 🔥 We released Magma training code, and an exampler for training Magma-8B on Magma-820K dataset. Check out the Model Training
[2025.03.06] We released a new demo for showing robot planning capabilities. Run python agents/robot_traj/app.py to start the demo!
[2025.02.28] We released two demos, Magma-UI and Magma-Gaming on Hugging Face. Check out our model's action grounding and planning capabilities!
[2025.02.26] ⭐ Exciting News! Magma got accepted by CVPR 2025!
[2025.02.25] 🎉 Big News! We are releasing the Magma model on Hugging Face and Azure AI Foundry!
[2025.02.23] We released the Magma Inference code!
[2025.02.20] Magma has reached the top spot on Hacker News!
[2025.02.19] We will be releasing our code, model and UI navigation demo by MSR Forum on 02.25 next Tuesday!
[2025.02.18] Our Flagship Project Magma at MSR is released on arXiv!

:bookmark_tabs: Todos

We will be releasing all the following contents:

[x] Model inference code
[x] Add UI and Gaming agent Demos
[x] Model checkpoint
[x] Training code
[x] Open-XE pretraining data with traces
[x] Video pretraining data with traces
[ ] SeeClick and Vision2UI pretraining data with SoM
[ ] UI/Libero finetuning script
[ ] Video finetune script

:clipboard: Outline

What is Magma?
How we pretrain Magma?
Installation
Data Preprocessing
- SoM and ToM Generation
Model Training
- Pretraining on Open-X without SoM/ToM
- Finetuning on Magma-820K
Model Usage
Citation
Acknowledgements

What is Magma?

Magma is a foundation model for multimodal AI agents. As the bedrock for multimodal agentic models, it should possess strong capabilities to perceive the multimodal world AND takes goal-driven actions precisely (see above figure). With this in mind, we are striving for the following goals:

Verbal and spatial-temporal intelligence: Magma is supposed to have both strong verbal and spatial-temporal intelligence to understand images and videos, ground its actions on the observations, and further translate the external goal into action plan and executions.
Digital and physical world: Magma should not be limited to either the digital world (e.g., web navigation) or the physical world (e.g., robotics manipulation), but rather be able to work across both worlds, just like humans ourselves.

With this in mind, we developed a new pretraining data, which mostly consists of unlabeled videos in the wild plus the existing annotated agentic data, and a new pretraining framework, which unifies the training of all three modalities (text, image, and action), to train a new foundation model for multimodal AI agents, named Magma.

How we pretrain Magma?

We pursue the goal through two dimensions:

Large-scale heterogeneous training data: we curate a large amount of data in the wild, including existing multimodal understanding data, UI navigation data, and robotics manipulation data, and unlabeled videos in the wild. We also propose a new data collection pipeline to collect unlabeled videos in the wild, which is scalable and cost-effective. To attain useful action supervision from raw videos and robotics trajectories, we meticulously removed the camera motions in the videos and then transform the motions into "action" supervisions for our model training. These provide unique signals for the model to learn the cross-modal connections and long-horizon action prediction and planning.
Universal pretraining objectives: texts and actions are inherently different and thus cause a huge gap, while visual tokens are continuous. We propose a universal pretraining framework that unifies the training of all three modalities, and we show that this is crucial for the model to learn the cross-modal connections. More specifically, we proposed Set-of-Mark and Trace-of-Mark as the auxiliary tasks for our model pretraining, as the bridge of different output modalities. In this way, we are building a great alignment between the text and action modalities, and also between the image and action modalities.

Installation

Clone this repo to your local machine:

git clone https://github.com/microsoft/Magma
cd Magma

Install the dependencies:

conda create -n magma python=3.10 -y
conda activate magma
pip install --upgrade pip
pip install -e .

Install packages for training:

pip install -e ".[train]"

Install p

Magma

Install / Use

README