Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

The framework supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B with image understanding, reasoning, and generation simultaneously. We build this repo based on LLaVA.

Release

[05/03] 🔥 We support LLaMA3-based models! Welcome to try them here.
[04/15] 🔥 The Hugging Face demo is available. It's a 13B-HD version, welcome to watch and try.
[03/28] 🔥 Mini-Gemini is coming! We release the paper, demo, code, models, and data!

Demo
Install
Model
Preparation
Train
Evaluation
Examples
Citation
Acknowledgement
License

Demo

We provide some selected examples in this section. More examples can be found in our project page. Feel free to try our online demo!

Install

Please follow the instructions below to install the required packages.

NOTE: If you want to use the 2B version, please ensure to install the latest version Transformers (>=4.38.0).

Clone this repository

git clone https://github.com/dvlab-research/MGM.git

Install Package

conda create -n mgm python=3.10 -y
conda activate mgm
cd MGM
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages for training cases

pip install ninja
pip install flash-attn --no-build-isolation

Model

The framework is conceptually simple: dual vision encoders are utilized to provide low-resolution visual embedding and high-resolution candidates; patch info mining is proposed to conduct patch-level mining between high-resolution regions and low-resolution visual queries; LLM is utilized to marry text with images for both comprehension and generation at the same time.

We provide all our fully finetuned models on Stage 1 and 2 data:

| Model | LR | HR | Base LLM | Vision Encoder | Finetuning Data | Finetuning schedule | Download | |----------|----------|----------|----------|----------------|---------------|--------------------|------------------| | MGM-2B | 336 | 768 | Gemma-2B | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-7B | 336 | 768 | Vicuna-7B-v1.5 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-13B | 336 | 768 | Vicuna-13B-v1.5 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-8B | 336 | 768 | LLaMA-3-8B-Instruct | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-8x7B | 336 | 768 | Mixtral-8x7B-Instruct-v0.1 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-34B | 336 | 768 | Nous-Hermes-2-Yi-34B | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-7B-HD | 672 | 1536 | Vicuna-7B-v1.5 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-13B-HD | 672 | 1536 | Vicuna-13B-v1.5 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-8B-HD | 672 | 1536 | LLaMA-3-8B-Instruct | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-8x7B-HD | 672 | 1536 | Mixtral-8x7B-Instruct-v0.1 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-34B-HD | 672 | 1536 | Nous-Hermes-2-Yi-34B | CLIP-L | MGM-Instruct | full_ft-1e | ckpt |

Here are the pretrained weights on Stage 1 data only: | Model | LR | HR | Base LLM | Vision Encoder | Pretrain Data | Finetuning schedule | Download | |----------|----------|----------|----------|----------------|---------------|--------------------|------------------| | MGM-2B | 336 | 768 | Gemma-2B | CLIP-L | MGM-Pretrain | 1e | ckpt | | MGM-7B | 336 | 768 | Vicuna-7B-v1.5 | CLIP-L | MGM-Pretrain | 1e | ckpt | | MGM-13B | 336 | 768 | Vicuna-13B-v1.5 | CLIP-L | MGM-Pretrain | 1e | ckpt | | MGM-8x7B | 336 | 768 | Mixtral-8x7B-Instruct-v0.1 | CLIP-L | MGM-Pretrain | 1e | ckpt | | MGM-34B | 336 | 768 | Nous-Hermes-2-Yi-34B | CLIP-L | MGM-Pretrain | 1e | ckpt |

Preparation

Dataset

We provide the processed data for the model training. For model pretraining, please download the following the training image-based data and organize them as:

-> means put the data in the local folder.

LLaVA Images -> data/MGM-Pretrain/images, data/MGM-Finetune/llava/LLaVA-Pretrain/images
ALLaVA Caption -> data/MGM-Pretrain/ALLaVA-4V

For model finetuning, please download the following the instruction data and organize them as:

-> means put the data in the local folder.

COCO train2017 -> data/MGM-Finetune/coco
GQA -> data/MGM-Finetune/gqa
OCR-VQA (we save all files as .jpg) -> data/MGM-Finetune/ocr_vqa
TextVQA (not included for training) -> data/MGM-Finetune/textvqa
VisualGenome part1, VisualGenome part2 -> data/MGM-Finetune/vg
ShareGPT4V-100K -> data/MGM-Finetune/sam, share_textvqa, wikiart, web-celebrity, web-landmark
LAION GPT4V -> data/MGM-Finetune/gpt4v-dataset
ALLaVA Instruction -> data/MGM-Pretrain/ALLaVA-4V
DocVQA -> data/MGM-Finetune/docvqa
ChartQA -> data/MGM-Finetune/chartqa
DVQA -> data/MGM-Finetune/dvqa
AI2D -> data/MGM-Finetune/ai2d

For model evaluation, please follow this link for preparation. We use some extra benchmarks for evaluation. please download the following the training image-based data and organize them as:

-> means put the data in the local folder.

MMMU -> data/MGM-Eval/MMMU
MMB -> data/MGM-Eval/MMB
MathVista -> data/MGM-Eval/MathVista

Please put the pretrained data, finetuned data, and eval data in MGM-Pretrain, MGM-Finetune, and MGM-Eval subset following Structure.

For meta info, please download the following files and organize them as in Structure.

| Data file name | Size | | --- | ---: | | mgm_pretrain.json | 1.68 G | | mgm_instruction.json | 1.79 G | | mgm_generation_pure_text.json | 0.04 G |

IMPORTANT: mgm_generation_pure_text.json is a generation-related subset. DO NOT merge it with mgm_instruction.json as it is already included in it. You may merge this file with your customized LLM/VLM SFT dataset to enable the reasoning generation ability.

Pretrained Weights

We recommend users to download the pretrained weights from the following link CLIP-Vit-L-336, OpenCLIP-ConvNeXt-L, Gemma-2b-it, Vicuna-7b-v1.5, Vicuna-13b-v1.5, [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/

MGM

Install / Use

README

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

Release

Contents

Demo

Install

Model

Preparation

Dataset

Pretrained Weights