MMIE

[ICLR'25 Oral] MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Generate Convert Improve

Install / Use

/learn @Lillianwei-h/MMIE

About this skill

Quality Score

0/100

README

<p align="center"><b>MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models</b></p>

<p align="center"> <a href="https://mmie-bench.github.io">[📖 Project]</a> <a href="https://arxiv.org/abs/2410.10139">[📄 Paper]</a> <a href="https://github.com/Lillianwei-h/MMIE">[💻 Code]</a> <a href="https://huggingface.co/datasets/MMIE/MMIE">[📝 Dataset]</a> <a href="https://huggingface.co/MMIE/MMIE-Score">[🤖 Evaluation Model]</a> <a href="https://huggingface.co/spaces/MMIE/Leaderboard">[🏆 Leaderboard]</a> </p>

</div>

🌟 Overview

We present MMIE, a Massive Multimodal Interleaved understanding Evaluation benchmark, designed for Large Vision-Language Models (LVLMs). MMIE provides a robust framework to assess the interleaved comprehension and generation capabilities of LVLMs across diverse domains, supported by reliable automated metrics.

📚 Setup

We have host MMIE dataset on HuggingFace, where you should request access on this page first and shall be automatically approved. Please download all the files in this repository and unzip images.tar.gz to get all images. We also provide overview.json, which is an example of the format of our dataset.

📦 Model Evaluation

Setup

Dataset Preparation

Your to-eval data format should be:

[
    {
        "id": "",
        "question": [
            {
                "text": "...",
                "image": LOCAL_PATH_TO_THE_IMAGE or null
            },
            ...
        ],
        "answer": [
            {
                "text": "...",
                "image": LOCAL_PATH_TO_THE_IMAGE or null
            },
            ...
        ],
        "model": "gt",
        "gt_answer": [
            {
                "text": "...",
                "image": LOCAL_PATH_TO_THE_IMAGE or null
            },
            ...
        ]
    },
    ...
]

Currently gt_answer is only used for Multi-step Reasoning tasks. But it is required in the data format. You can set "gt_answer": [{"text": None,"image":None}] for other tasks.

Make sure the file structure be:

INPUT_DIR
    |INPUT_FILE(data.json)
    |images
        |0.png
        |1.png
        |...

Installation

Clone code from this repo

git clone https://github.com/Lillianwei-h/MMIE
cd MMIE

Build environment

conda create -n MMIE python=3.11
conda activate MMIE
pip install -r requirements.txt
pip install flash_attn

Model Preparation

You can request access to our MMIE-Score model on HuggingFace and refer to the document of InternVL 2.0 to find more details.

Run

python main.py --input_dir INPUT_DIR --input_file INPUT_FILE

The output file should be at ./eval_outputs/eval_result.json by default. You can also use arguments --output_dir and --output_file to specify your intended output position.

📝 Citation

If you find our benchmark useful in your research, please kindly consider citing us:

@article{xia2024mmie,
  title={MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models},
  author={Xia, Peng and Han, Siwei and Qiu, Shi and Zhou, Yiyang and Wang, Zhaoyang and Zheng, Wenhao and Chen, Zhaorun and Cui, Chenhang and Ding, Mingyu and Li, Linjie and Wang, Lijuan and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2410.10139},
  year={2024}
}

Related Skills

node-connect

353.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

353.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

353.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。