SkillAgentSearch skills...

MGM

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

Install / Use

/learn @JIA-Lab-research/MGM
About this skill

Quality Score

0/100

Supported Platforms

Gemini CLI

README

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

<a href='https://mini-gemini.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='http://103.170.5.190:7860/'><img src='https://img.shields.io/badge/Project-Demo-violet'></a> <a href='https://huggingface.co/spaces/wcy1122/MGM'><img src='https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg'></a> <a href='https://arxiv.org/pdf/2403.18814.pdf'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/collections/YanweiLi/mgm-6603c50b9b43d044171d0854'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a> <a href='https://huggingface.co/collections/YanweiLi/mgm-data-660463ea895a01d8f367624e'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-green'></a>

The framework supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B with image understanding, reasoning, and generation simultaneously. We build this repo based on LLaVA.

Release

  • [05/03] 🔥 We support LLaMA3-based models! Welcome to try them here.
  • [04/15] 🔥 The Hugging Face demo is available. It's a 13B-HD version, welcome to watch and try.
  • [03/28] 🔥 Mini-Gemini is coming! We release the paper, demo, code, models, and data!

Contents

Demo

We provide some selected examples in this section. More examples can be found in our project page. Feel free to try our online demo!

<div align=center> <img width="100%" src="images/teaser.png"/> </div>

Install

Please follow the instructions below to install the required packages.

NOTE: If you want to use the 2B version, please ensure to install the latest version Transformers (>=4.38.0).

  1. Clone this repository
git clone https://github.com/dvlab-research/MGM.git
  1. Install Package
conda create -n mgm python=3.10 -y
conda activate mgm
cd MGM
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install ninja
pip install flash-attn --no-build-isolation

Model

The framework is conceptually simple: dual vision encoders are utilized to provide low-resolution visual embedding and high-resolution candidates; patch info mining is proposed to conduct patch-level mining between high-resolution regions and low-resolution visual queries; LLM is utilized to marry text with images for both comprehension and generation at the same time.

<div align=center> <img width="98%" src="images/pipeline.png"/> </div>

We provide all our fully finetuned models on Stage 1 and 2 data:

| Model | LR | HR | Base LLM | Vision Encoder | Finetuning Data | Finetuning schedule | Download | |----------|----------|----------|----------|----------------|---------------|--------------------|------------------| | MGM-2B | 336 | 768 | Gemma-2B | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-7B | 336 | 768 | Vicuna-7B-v1.5 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-13B | 336 | 768 | Vicuna-13B-v1.5 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-8B | 336 | 768 | LLaMA-3-8B-Instruct | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-8x7B | 336 | 768 | Mixtral-8x7B-Instruct-v0.1 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-34B | 336 | 768 | Nous-Hermes-2-Yi-34B | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-7B-HD | 672 | 1536 | Vicuna-7B-v1.5 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-13B-HD | 672 | 1536 | Vicuna-13B-v1.5 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-8B-HD | 672 | 1536 | LLaMA-3-8B-Instruct | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-8x7B-HD | 672 | 1536 | Mixtral-8x7B-Instruct-v0.1 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-34B-HD | 672 | 1536 | Nous-Hermes-2-Yi-34B | CLIP-L | MGM-Instruct | full_ft-1e | ckpt |

Here are the pretrained weights on Stage 1 data only: | Model | LR | HR | Base LLM | Vision Encoder | Pretrain Data | Finetuning schedule | Download | |----------|----------|----------|----------|----------------|---------------|--------------------|------------------| | MGM-2B | 336 | 768 | Gemma-2B | CLIP-L | MGM-Pretrain | 1e | ckpt | | MGM-7B | 336 | 768 | Vicuna-7B-v1.5 | CLIP-L | MGM-Pretrain | 1e | ckpt | | MGM-13B | 336 | 768 | Vicuna-13B-v1.5 | CLIP-L | MGM-Pretrain | 1e | ckpt | | MGM-8x7B | 336 | 768 | Mixtral-8x7B-Instruct-v0.1 | CLIP-L | MGM-Pretrain | 1e | ckpt | | MGM-34B | 336 | 768 | Nous-Hermes-2-Yi-34B | CLIP-L | MGM-Pretrain | 1e | ckpt |

Preparation

Dataset

We provide the processed data for the model training. For model pretraining, please download the following the training image-based data and organize them as:

-> means put the data in the local folder.

  • LLaVA Images -> data/MGM-Pretrain/images, data/MGM-Finetune/llava/LLaVA-Pretrain/images
  • ALLaVA Caption -> data/MGM-Pretrain/ALLaVA-4V

For model finetuning, please download the following the instruction data and organize them as:

-> means put the data in the local folder.

For model evaluation, please follow this link for preparation. We use some extra benchmarks for evaluation. please download the following the training image-based data and organize them as:

-> means put the data in the local folder.

  • MMMU -> data/MGM-Eval/MMMU
  • MMB -> data/MGM-Eval/MMB
  • MathVista -> data/MGM-Eval/MathVista

Please put the pretrained data, finetuned data, and eval data in MGM-Pretrain, MGM-Finetune, and MGM-Eval subset following Structure.

For meta info, please download the following files and organize them as in Structure.

| Data file name | Size | | --- | ---: | | mgm_pretrain.json | 1.68 G | | mgm_instruction.json | 1.79 G | | mgm_generation_pure_text.json | 0.04 G |

IMPORTANT: mgm_generation_pure_text.json is a generation-related subset. DO NOT merge it with mgm_instruction.json as it is already included in it. You may merge this file with your customized LLM/VLM SFT dataset to enable the reasoning generation ability.

Pretrained Weights

We recommend users to download the pretrained weights from the following link CLIP-Vit-L-336, OpenCLIP-ConvNeXt-L, Gemma-2b-it, Vicuna-7b-v1.5, Vicuna-13b-v1.5, [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/

View on GitHub
GitHub Stars3.3k
CategoryDevelopment
Updated15h ago
Forks276

Languages

Python

Security Score

100/100

Audited on Mar 27, 2026

No findings