CoBSAT
Implementation and dataset for paper "Can MLLMs Perform Text-to-Image In-Context Learning?"
Install / Use
/learn @UW-Madison-Lee-Lab/CoBSATREADME
Abstract: The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation. To overcome these challenges, we explore strategies like fine-tuning and Chain-of-Thought prompting, demonstrating notable improvements. Our code and dataset are available at <a href="https://github.com/UW-Madison-Lee-Lab/CoBSAT">this link</a>.
<img width="903" alt="image" src="imgs/t2i_icl.jpg">News 🚀
- [07/10/24] Our paper is accepted by COLM 2024!
- [02/29/24] Our dataset is available on 🤗huggingface!
- [02/02/24] Our paper is available on <a href="https://arxiv.org/abs/2402.01293">arxiv</a>!
Contents
Step 1: Set Up Environment
To set up the environment for benchmarking MLLMs, please follow the following steps. This works for linux.
-
Clone this repository and rename it as
cobsatgit clone --recurse-submodules https://github.com/UW-Madison-Lee-Lab/CoBSAT mv CoBSAT cobsat cd cobsat -
Install Packages
<details><summary> Linux </summary>
</details> <details><summary> Mac </summary> TBA </details> <details><summary> Windows </summary> TBA </details># create the environment that works for most of the cases conda create -n cobsat python=3.8.18 conda activate cobsat pip install torch==2.1.2 torchvision==0.16.2 pip install -r conda_env/default_requirements.txt # create the environment for llava to work conda create -n llava python=3.10.13 conda activate llava pip install --upgrade pip # enable PEP 660 support pip install git+https://github.com/yzeng58/LLaVA/@a61aae093656922fe16ec2152b031dd1de72fe92 pip install -r conda_env/llava_requirements.txt # create the environment for gemini to work conda env create -f conda_env/gemini.yml # create the environment for llava16 (LLaVA-NeXT) to work conda env create -f conda_env/llava16.yml -
Create
environment.pyin thecobsatdirectory. Note that many variables need you to config exceptroot_diron your own# Configure the environment variables for the project import os root_dir = os.path.dirname(os.path.abspath(__file__)) SEED_PROJECT_ROOT = f'{root_dir}/models/SEED' ############### # NEED UPDATE # ############### TRANSFORMER_CACHE = '/data/yzeng58/.cache/huggingface/hub' ######################### # NEED UPDATE IF NEEDED # ######################### # GPT-4V OPENAI_API_KEY = { 'key1': f'{your_openai_key_1}', 'key2': f'{your_openai_key_2}', } # Gemini GEMINI_API_KEY = { 'key1': f'{your_gemini_key_1}', 'key2': f'{your_gemini_key_2}', } # Claude CLAUDE_API_KEY = { 'key1': f'{your_claude_key_1}', 'key2': f'{your_claude_key_2}', } # Emu for Image Generation EMU_IMAGE_PATH = '/data/yzeng58/cobsat/models/Emu/Emu1/model_weights/Emu/pretrain' # Emu-Instruct EMU_INSTRUCT_PATH = '/data/yzeng58/cobsat/models/Emu/Emu1/model_weights/Emu/Emu-instruct.pt' # Emu-Generation EMU_TEXT_PATH = '/data/yzeng58/cobsat/models/Emu/Emu1/model_weights/Emu/Emu-pretrain.pt' # WANDB Logging https://wandb.ai/site WANDB_ENTITY = 'lee-lab-uw-madison' WANDB_PROJECT = 'cobsat'
Step 2: Download Dataset
<img width="903" alt="image" src="imgs/dataset_overview.jpg">To use our dataset, please follow the following steps.
-
Download the images and their corresponding descriptions of our dataset.
wget "https://huggingface.co/datasets/yzeng58/CoBSAT/resolve/main/datasets.zip" -
Uncompress the
datasets.zipfile viaunzip datasets.zipand move thedatasetsfolder into yourcobsatfolder.
Up to now, the structure of your cobsat folder should look like this.
.
├── ...
├── datasets # download the dataset in this step
├── load_models
│ ├── call_emu.py
│ ├── call_emu2.py
│ ├── call_gill.py
│ ├── call_gpt.py
│ ├── call_llava.py # LLaVA-1.5
│ ├── call_llava16.py # LLaVA-NeXT
│ ├── call_qwen.py
│ ├── call_seed.py
│ ├── call_gemini.py
│ ├── call_claude.py
│ ├── call_your_model.py # [optional] create python file to load the model you want to evaluate
│ └── ...
├── models
│ ├── SEED
│ ├── gill
│ ├── Emu
│ │ └── Emu1
│ ├── LLaVA
│ ├── Qwen-VL
│ ├── Gemini
│ ├── Claude
│ ├── OwnModel # [optional] input your own model folder
│ └── ...
├── ...
├── environment.py # follow the instruction above to create this file
├── load_model.py # [optional] add your own model
└── ...
Step 3: Select MLLMs
We have implemented several state-of-the-art models for your convenience. Additionally, we offer guidelines for integrating your own MLLMs.
Supported Models
- [x] SEED-LLaMA
- Image Generation
- Text Generation
- Fine-Tuning
- [x] GILL
- Image Generation
- Text Generation
- [x] Emu
- Image Generation
- Text Generation
- [x] Emu2
- Image Generation
- Text Generation
- [x] GPT-4V
- Text Generation
- [x] LLaVA-1.5
- Text Generation
- [x] LLaVA-1.6/LLaVA-NeXT
- Text Generation
- [x] Qwen-VL
- Text Generation
- Fine-Tuning
- [x] Gemini
- Text Generation <<<<<<< HEAD
- Image Generation =======
- [x] Claude
- Text Generation
a787be51b52af1d63a6dbeb78f22b07a10761be0
Feature Your Own Model
Throughout this section, the placeholder "OwnModel" can be substituted with the name of your specific model, such as "gpt4v".
-
Create your own model folder
OwnModel/inmodels/if needed. Check this for examples. -
Create python file
<details><summary> <code>call_OwnModel.py</code> template </summary> Your `call_OwnModel.py` script should include at least the following essential functions:call_OwnModel.pyinload_models/to load your own model.load_OwnModel: Utilized for loading the model to avoid repeated loading during inference or fine-tuning. In certain cases, this function may not be necessary. For example, OpenAI provides API access for GPT-4V, enabling inference without the need to explicitly load the model.call_OwnModel: Employs the model to perform inference tasks.
# Template def load_OwnModel( device = 'cuda', seed = 123, ): ... return model, othersYou have the flexibility to define the input parameters and the format of the return values according to your needs.
# Template def call_OwnModel( text_inputs = ["Angry", "Cry", "Fly"], image_inputs = [ "datasets/action_dog/angry_dog.jpg", "datasets/action_dog/cry_dog.jpg", ], seed = 123, gen_mode = 'text', instruction = [ "I will provide you with a few examples with text and images. Complete the example with the description of the next image. The description should be clear with main object, and include details such as color, texture, background, style, an
