<kbd> <img src="assets/multinet_logo_square copy.png" alt="Multinet Logo" style="height:200px; border-radius:50%;"> <h1 align="center" style="display: inline-block; vertical-align: middle; margin-left: 20px;">MultiNet: A Generalist Benchmark for the Next Generation of Multimodal Models</h1> </kbd> <a href="https://multinet.ai/"><img src="https://img.shields.io/badge/Website-blue?style=flat-square&logo=googlechrome" alt="Website"></a> <a href="https://multinet.ai/static/pages/Multinetv1.html"><img src="https://img.shields.io/badge/Multinet%20v1.0-Release-blue?style=flat-square&logo=Blogger" alt="Multinet v1.0 release"></a> <a href="https://arxiv.org/abs/2505.05540"><img src="https://img.shields.io/badge/Multinet%20v0.2%20paper-arXiv-B31B1B?style=flat-square&logo=arXiv" alt="Multinet v0.2 paper"></a> <a href="https://arxiv.org/abs/2411.05821"><img src="https://img.shields.io/badge/Multinet%20v0.1%20paper-arXiv-B31B1B?style=flat-square&logo=arXiv" alt="Multinet v0.1 paper"></a> <a href="https://github.com/ManifoldRG/MultiNet/tree/main/src/modules"><img src="https://img.shields.io/badge/GenESIS%E2%A0%80%E2%A0%80%E2%A0%80%E2%A0%80%E2%A0%80%E2%A0%80%E2%A0%80-blueviolet?style=flat-square&logo=github" alt="GenESIS framework"></a> <a href="https://discord.gg/Rk4gAq5aYr"><img src="https://img.shields.io/badge/Contribute%E2%A0%80%E2%A0%80%E2%A0%80%E2%A0%80%E2%A0%80-7289DA?style=flat-square&logo=discord" alt="Contribute"></a>

MultiNet is a collaborative initiative with contributions from leading research teams at institutions like:

Need to Run Evaluations on Production Multimodal, Computer Use, or Robotics AI System? We can help!

📢 Updates

🌟 2025-13-10: Multinet v1.0 - We release our most comprehensive benchmark yet - evaluating a SoTA VLM, VLA, and generalist model on a wide variety of multimodal understanding and action datasets. Read more here
🏅 2025-06-10: Paper accepted at ICML 2025! Our paper detailing the Open-Source contributions of Multinet that benefit the AI community has been accepted at the CodeML Workshop at ICML 2025! Read our paper here.
🏆 2025-05-22: Multinet v0.2 - We systematically profile state-of-the-art VLAs and VLMs to understand how they perform in procedurally generated OOD game environments! Read more about our release here
🎉 2024-11-08: We release the first version of MultiNet where we profiled SoTA VLMs and VLAs on real-world robotics tasks - Multinet v0.1! Check our release page for more details.
🚀 2024-03-22: Introducing Multinet! A new generalist benchmark to evaluate Vision-Language & Action models. Learn more here

🔍 Overview

This repo provides the following:

Ability to profile VLMs, VLAs, and generalist models on our generalist evaluation framework with a comprehensive coverage of open-source physical commonsense reasoning, image classification, visual question answering, control/action (RL, Robotics), gameplay, and function calling tasks
Ability to translate control data of various formats and from various sources to a unified Tensorflow Dataset format.
Evaluate the performance of SoTA VLMs and VLAs such as GPT-5, Pi0, and Magma in a zero-shot setting on a wide-variety of tasks detaied here.
A general framework for mapping VLMs to other modality classes, with particular emphasis on action spaces. This framework allows one to adapt a wide range of models to multiple types of tasks or datasets for scaling effectively while reducing the amount of engineering effort required. In MultiNet v1.0, GenESIS is used to evaluate GPT 5 on the OpenX, Overcooked, PIQA, ODINW, and SQA3D datasets.
Sample datasets and clear guidelines to test your model locally and submit for official benchmark evaluation; leaderboard results are generated by the MultiNet team.

Also related to the MultiNet effort is <a href="https://github.com/eihli/mugato"><img src="https://img.shields.io/badge/%CE%BCGATO%E2%A0%80%E2%A0%80-dimgray?style=flat-square&logo=github" alt="μGATO on GitHub" style="vertical-align: middle;"></a> - a small, simple, open-source implementation of what is described in DeepMind's GATO paper. This project marks our initial step towards building a multimodal generalist action model.

🚀 Getting Started

To set up the environment for Multinet:

conda create -n multinet python=3.10
conda activate multinet
git clone https://github.com/ManifoldRG/MultiNet.git
cd MultiNet/src
pip install -r requirements.txt

To download the datasets in v1:

cd Multinet/src/v1
python centralized_downloader.py --download <name of dataset you would like to download>

To translate one file/shard of your desired control dataset (downloaded using the downloader script in this repo) to the TFDS format

cd Multinet/src/v1
python centralized_processor.py --input_dir <path to the downloaded dataset> --output_dir <directory where you would like to store the translated file>

To translate multiple files/shards of your desired control dataset (downloaded using the downloader script in this repo) to the TFDS format

Note: Make sure to modify the way the multiple files are being traversed for translation in translate_multiple.py in Multinet/src/control_translation according to your local file structure.

cd Multinet/src/v1
python wrapper_centralized_processor.py --input_dir <path to the downloaded dataset> --output_dir <directory where you would like to store the translated file>

To evaluate models on MultiNet datasets

We provide comprehensive evaluation guides for different models:

Magma Model Evaluation: For detailed instructions on evaluating Magma on ODINW, PIQA, SQA3D, RoboVQA, Overcooked, BFCL, and OpenX datasets, see the Magma Evaluation Guide.

Pi0 Base Model Evaluation: For detailed instructions on evaluating Pi0 Base on ODINW, PIQA, SQA3D, RoboVQA, BFCL, Overcooked, and OpenX datasets, see the Pi0 Evaluation Guide.

GPT Model Evaluation (GenESIS Framework): For detailed instructions on evaluating GPT-5 using the GenESIS framework on ODINW, PIQA, SQA3D, RoboVQA, Overcooked, and OpenX datasets, see the GenESIS Evaluation Guide.

📊 Process for Submission to the MultiNet Benchmark

We provide a submission toolkit and comprehensive instructions to benchmark your model on MultiNet datasets:

Standardized Interface: Create model adapters that inherit from the base ModelAdapter class
Dockerized Evaluation: Reproducible evaluations in isolated containers
Various Task Types: Support for datasets that span VQA, action prediction, function calling, and more

Quick Start:

Create your model adapter(s) by inheriting from ModelAdapter in src/eval_harness/model_adapter.py
Test your model adapter using the scripts/eval_harness/evaluate.py entrypoint which loads sample data in a standard format
Configure harness_dataset_config.txt and Dockerfile with your adapter settings
Run ./build_and_run_eval_container.sh DATASET_NAME to test containerized evaluation
Open a PR with your code

Official benchmark runs are executed by the MultiNet team using your submitted Dockerfile and adapters. Local runs operate on the provided sample datasets to validate your setup.

For complete instructions, see the Model Submission Guide.

If you're experiencing any issues, open a GitHub issue or contact pranav@metarch.ai directly.

📜 Citation

If you use MultiNet in your research, please cite our work:


ICML CodeML Paper Submission - An Open-Source Software Toolkit & Benchmark Suite for the Evaluation and Adaptation of Multimodal Action Models

@misc{guruprasad2025opensourcesoftwaretoolkit,
      title={An Open-Source Software Toolkit & Benchmark Sui

MultiNet

Install / Use

README