MGM
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
Install / Use
/learn @JIA-Lab-research/MGMREADME
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
<a href='https://mini-gemini.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='http://103.170.5.190:7860/'><img src='https://img.shields.io/badge/Project-Demo-violet'></a> <a href='https://huggingface.co/spaces/wcy1122/MGM'><img src='https://img.shields.io/badge/🤗-Open%20In%20Spaces-blue.svg'></a> <a href='https://arxiv.org/pdf/2403.18814.pdf'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/collections/YanweiLi/mgm-6603c50b9b43d044171d0854'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a> <a href='https://huggingface.co/collections/YanweiLi/mgm-data-660463ea895a01d8f367624e'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Data-green'></a>
The framework supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B with image understanding, reasoning, and generation simultaneously. We build this repo based on LLaVA.
Release
- [05/03] 🔥 We support LLaMA3-based models! Welcome to try them here.
- [04/15] 🔥 The Hugging Face demo is available. It's a 13B-HD version, welcome to watch and try.
- [03/28] 🔥 Mini-Gemini is coming! We release the paper, demo, code, models, and data!
Contents
Demo
We provide some selected examples in this section. More examples can be found in our project page. Feel free to try our online demo!
<div align=center> <img width="100%" src="images/teaser.png"/> </div>Install
Please follow the instructions below to install the required packages.
NOTE: If you want to use the 2B version, please ensure to install the latest version Transformers (>=4.38.0).
- Clone this repository
git clone https://github.com/dvlab-research/MGM.git
- Install Package
conda create -n mgm python=3.10 -y
conda activate mgm
cd MGM
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install ninja
pip install flash-attn --no-build-isolation
Model
The framework is conceptually simple: dual vision encoders are utilized to provide low-resolution visual embedding and high-resolution candidates; patch info mining is proposed to conduct patch-level mining between high-resolution regions and low-resolution visual queries; LLM is utilized to marry text with images for both comprehension and generation at the same time.
<div align=center> <img width="98%" src="images/pipeline.png"/> </div>We provide all our fully finetuned models on Stage 1 and 2 data:
| Model | LR | HR | Base LLM | Vision Encoder | Finetuning Data | Finetuning schedule | Download | |----------|----------|----------|----------|----------------|---------------|--------------------|------------------| | MGM-2B | 336 | 768 | Gemma-2B | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-7B | 336 | 768 | Vicuna-7B-v1.5 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-13B | 336 | 768 | Vicuna-13B-v1.5 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-8B | 336 | 768 | LLaMA-3-8B-Instruct | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-8x7B | 336 | 768 | Mixtral-8x7B-Instruct-v0.1 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-34B | 336 | 768 | Nous-Hermes-2-Yi-34B | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-7B-HD | 672 | 1536 | Vicuna-7B-v1.5 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-13B-HD | 672 | 1536 | Vicuna-13B-v1.5 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-8B-HD | 672 | 1536 | LLaMA-3-8B-Instruct | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-8x7B-HD | 672 | 1536 | Mixtral-8x7B-Instruct-v0.1 | CLIP-L | MGM-Instruct | full_ft-1e | ckpt | | MGM-34B-HD | 672 | 1536 | Nous-Hermes-2-Yi-34B | CLIP-L | MGM-Instruct | full_ft-1e | ckpt |
Here are the pretrained weights on Stage 1 data only: | Model | LR | HR | Base LLM | Vision Encoder | Pretrain Data | Finetuning schedule | Download | |----------|----------|----------|----------|----------------|---------------|--------------------|------------------| | MGM-2B | 336 | 768 | Gemma-2B | CLIP-L | MGM-Pretrain | 1e | ckpt | | MGM-7B | 336 | 768 | Vicuna-7B-v1.5 | CLIP-L | MGM-Pretrain | 1e | ckpt | | MGM-13B | 336 | 768 | Vicuna-13B-v1.5 | CLIP-L | MGM-Pretrain | 1e | ckpt | | MGM-8x7B | 336 | 768 | Mixtral-8x7B-Instruct-v0.1 | CLIP-L | MGM-Pretrain | 1e | ckpt | | MGM-34B | 336 | 768 | Nous-Hermes-2-Yi-34B | CLIP-L | MGM-Pretrain | 1e | ckpt |
Preparation
Dataset
We provide the processed data for the model training. For model pretraining, please download the following the training image-based data and organize them as:
-> means put the data in the local folder.
- LLaVA Images ->
data/MGM-Pretrain/images,data/MGM-Finetune/llava/LLaVA-Pretrain/images - ALLaVA Caption ->
data/MGM-Pretrain/ALLaVA-4V
For model finetuning, please download the following the instruction data and organize them as:
-> means put the data in the local folder.
- COCO train2017 ->
data/MGM-Finetune/coco - GQA ->
data/MGM-Finetune/gqa - OCR-VQA (we save all files as
.jpg) ->data/MGM-Finetune/ocr_vqa - TextVQA (not included for training) ->
data/MGM-Finetune/textvqa - VisualGenome part1, VisualGenome part2 ->
data/MGM-Finetune/vg - ShareGPT4V-100K ->
data/MGM-Finetune/sam,share_textvqa,wikiart,web-celebrity,web-landmark - LAION GPT4V ->
data/MGM-Finetune/gpt4v-dataset - ALLaVA Instruction ->
data/MGM-Pretrain/ALLaVA-4V - DocVQA ->
data/MGM-Finetune/docvqa - ChartQA ->
data/MGM-Finetune/chartqa - DVQA ->
data/MGM-Finetune/dvqa - AI2D ->
data/MGM-Finetune/ai2d
For model evaluation, please follow this link for preparation. We use some extra benchmarks for evaluation. please download the following the training image-based data and organize them as:
-> means put the data in the local folder.
Please put the pretrained data, finetuned data, and eval data in MGM-Pretrain, MGM-Finetune, and MGM-Eval subset following Structure.
For meta info, please download the following files and organize them as in Structure.
| Data file name | Size | | --- | ---: | | mgm_pretrain.json | 1.68 G | | mgm_instruction.json | 1.79 G | | mgm_generation_pure_text.json | 0.04 G |
IMPORTANT: mgm_generation_pure_text.json is a generation-related subset. DO NOT merge it with mgm_instruction.json as it is already included in it. You may merge this file with your customized LLM/VLM SFT dataset to enable the reasoning generation ability.
Pretrained Weights
We recommend users to download the pretrained weights from the following link CLIP-Vit-L-336, OpenCLIP-ConvNeXt-L, Gemma-2b-it, Vicuna-7b-v1.5, Vicuna-13b-v1.5, [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/
