SkillAgentSearch skills...

CoBSAT

Implementation and dataset for paper "Can MLLMs Perform Text-to-Image In-Context Learning?"

Install / Use

/learn @UW-Madison-Lee-Lab/CoBSAT
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<h1 align="center"> <p>Can MLLMs Perform Multimodal In-Context Learning for Text-to-Image Generation?</p></h1> <h4 align="center"> <p> <a href="https://yzeng58.github.io/" target="_blank">Yuchen Zeng</a><sup>*1</sup>, <a href="https://wonjunn.github.io/" target="_blank">Wonjun Kang</a><sup>*2</sup>, <a href="https://bryce-chen.github.io/" target="_blank">Yicong Chen</a><sup>1</sup>, <a href="http://cvml.ajou.ac.kr/wiki/index.php/Professor" target="_blank">Hyung Il Koo</a><sup>2</sup>, <a href="https://kangwooklee.com/aboutme/" target="_blank">Kangwook Lee</a><sup>1</sup> </p> <p> <sup>1</sup>UW-Madison, <sup>2</sup> FuriosaAI </p> </h4> <p align="center"> <a href="https://github.com/UW-Madison-Lee-Lab/CoBSAT/releases"> <img alt="GitHub release" src="https://img.shields.io/github/release/UW-Madison-Lee-Lab/CoBSAT.svg"> </a> <a href="https://arxiv.org/abs/2402.01293"> <img alt="GitHub release" src="https://img.shields.io/badge/arXiv-2402.01293-b31b1b.svg"> </a> <a href="https://huggingface.co/datasets/yzeng58/CoBSAT"> <img alt="Hugging Face" src="https://img.shields.io/badge/dataset-CoBSAT-orange"> </a> </p>

Abstract: The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation. To overcome these challenges, we explore strategies like fine-tuning and Chain-of-Thought prompting, demonstrating notable improvements. Our code and dataset are available at <a href="https://github.com/UW-Madison-Lee-Lab/CoBSAT">this link</a>.

<img width="903" alt="image" src="imgs/t2i_icl.jpg">

News 🚀

  • [07/10/24] Our paper is accepted by COLM 2024!
  • [02/29/24] Our dataset is available on 🤗huggingface!
  • [02/02/24] Our paper is available on <a href="https://arxiv.org/abs/2402.01293">arxiv</a>!

Contents

Step 1: Set Up Environment

To set up the environment for benchmarking MLLMs, please follow the following steps. This works for linux.

  1. Clone this repository and rename it as cobsat

    git clone --recurse-submodules https://github.com/UW-Madison-Lee-Lab/CoBSAT
    mv CoBSAT cobsat
    cd cobsat
    
  2. Install Packages

    <details><summary> Linux </summary>
    # create the environment that works for most of the cases
    conda create -n cobsat python=3.8.18
    conda activate cobsat
    pip install torch==2.1.2 torchvision==0.16.2 
    pip install -r conda_env/default_requirements.txt
    
    # create the environment for llava to work 
    conda create -n llava python=3.10.13
    conda activate llava
    pip install --upgrade pip  # enable PEP 660 support
    pip install git+https://github.com/yzeng58/LLaVA/@a61aae093656922fe16ec2152b031dd1de72fe92
    pip install -r conda_env/llava_requirements.txt
    
    # create the environment for gemini to work 
    conda env create -f conda_env/gemini.yml
    
    # create the environment for llava16 (LLaVA-NeXT) to work 
    conda env create -f conda_env/llava16.yml
    
    </details> <details><summary> Mac </summary> TBA </details> <details><summary> Windows </summary> TBA </details>
  3. Create environment.py in the cobsat directory. Note that many variables need you to config except root_dir on your own

    # Configure the environment variables for the project
    
    import os
    root_dir = os.path.dirname(os.path.abspath(__file__))
    
    SEED_PROJECT_ROOT = f'{root_dir}/models/SEED'
    
    ###############
    # NEED UPDATE #
    ###############
    TRANSFORMER_CACHE = '/data/yzeng58/.cache/huggingface/hub' 
    
    #########################
    # NEED UPDATE IF NEEDED #
    #########################
    # GPT-4V
    OPENAI_API_KEY = { 
      'key1': f'{your_openai_key_1}',
      'key2': f'{your_openai_key_2}',
    }
    # Gemini
    GEMINI_API_KEY = {
      'key1': f'{your_gemini_key_1}',
      'key2': f'{your_gemini_key_2}',
    }
    # Claude
    CLAUDE_API_KEY = {
      'key1': f'{your_claude_key_1}',
      'key2': f'{your_claude_key_2}',
    }
    # Emu for Image Generation
    EMU_IMAGE_PATH = '/data/yzeng58/cobsat/models/Emu/Emu1/model_weights/Emu/pretrain' 
    # Emu-Instruct
    EMU_INSTRUCT_PATH = '/data/yzeng58/cobsat/models/Emu/Emu1/model_weights/Emu/Emu-instruct.pt' 
    # Emu-Generation
    EMU_TEXT_PATH = '/data/yzeng58/cobsat/models/Emu/Emu1/model_weights/Emu/Emu-pretrain.pt'
    # WANDB Logging https://wandb.ai/site
    WANDB_ENTITY = 'lee-lab-uw-madison'
    WANDB_PROJECT = 'cobsat'
    

Step 2: Download Dataset

<img width="903" alt="image" src="imgs/dataset_overview.jpg">

To use our dataset, please follow the following steps.

  1. Download the images and their corresponding descriptions of our dataset.

    wget "https://huggingface.co/datasets/yzeng58/CoBSAT/resolve/main/datasets.zip"
    
  2. Uncompress the datasets.zip file via unzip datasets.zip and move the datasets folder into your cobsat folder.

Up to now, the structure of your cobsat folder should look like this.

.
├── ...          
├── datasets                # download the dataset in this step
├── load_models
│   ├── call_emu.py
│   ├── call_emu2.py
│   ├── call_gill.py
│   ├── call_gpt.py
│   ├── call_llava.py       # LLaVA-1.5
│   ├── call_llava16.py     # LLaVA-NeXT 
│   ├── call_qwen.py
│   ├── call_seed.py
│   ├── call_gemini.py
│   ├── call_claude.py
│   ├── call_your_model.py  # [optional] create python file to load the model you want to evaluate
│   └── ... 
├── models                  
│   ├── SEED                
│   ├── gill                
│   ├── Emu                 
│   │   └── Emu1 
│   ├── LLaVA               
│   ├── Qwen-VL    
│   ├── Gemini
│   ├── Claude   
│   ├── OwnModel            # [optional] input your own model folder
│   └── ...
├── ...
├── environment.py          # follow the instruction above to create this file
├── load_model.py           # [optional] add your own model                
└── ...

Step 3: Select MLLMs

We have implemented several state-of-the-art models for your convenience. Additionally, we offer guidelines for integrating your own MLLMs.

Supported Models

  • [x] SEED-LLaMA
    • Image Generation
    • Text Generation
    • Fine-Tuning
  • [x] GILL
    • Image Generation
    • Text Generation
  • [x] Emu
    • Image Generation
    • Text Generation
  • [x] Emu2
    • Image Generation
    • Text Generation

a787be51b52af1d63a6dbeb78f22b07a10761be0

Feature Your Own Model

Throughout this section, the placeholder "OwnModel" can be substituted with the name of your specific model, such as "gpt4v".

  1. Create your own model folder OwnModel/ in models/ if needed. Check this for examples.

  2. Create python file call_OwnModel.py in load_models/ to load your own model.

    <details><summary> <code>call_OwnModel.py</code> template </summary> Your `call_OwnModel.py` script should include at least the following essential functions:
    • load_OwnModel: Utilized for loading the model to avoid repeated loading during inference or fine-tuning. In certain cases, this function may not be necessary. For example, OpenAI provides API access for GPT-4V, enabling inference without the need to explicitly load the model.
    • call_OwnModel: Employs the model to perform inference tasks.
    # Template
    def load_OwnModel(
        device = 'cuda',
        seed = 123,
    ):
    		...
        return model, others
    

    You have the flexibility to define the input parameters and the format of the return values according to your needs.

    # Template
    def call_OwnModel(
        text_inputs = ["Angry", "Cry", "Fly"],
        image_inputs = [
            "datasets/action_dog/angry_dog.jpg",
            "datasets/action_dog/cry_dog.jpg",
        ],
        seed = 123,
        gen_mode = 'text',
      	instruction = [
            "I will provide you with a few examples with text and images. Complete the example with the description of the next image. The description should be clear with main object, and include details such as color, texture, background, style, an
    
View on GitHub
GitHub Stars48
CategoryEducation
Updated25d ago
Forks1

Languages

Jupyter Notebook

Security Score

75/100

Audited on Mar 3, 2026

No findings