TextBind
[2024-ACL]: TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wildrounded Conversation
Install / Use
/learn @SihengLi99/TextBindREADME
TextBind: Multi-turn Interleaved Multimodal Instruction-following
<span id='content'/>
Content:
- <a href='#introduction'>1. Introduction</a>
- <a href='#running_textbind'>2. Build Our Demo Locally</a>
- <a href='#install_environment'>2.1. Environment Installation</a>
- <a href='#prepare_vision_model'>2.2. Prepare Vsion Model</a>
- <a href='#prepare_textbind_weights'>2.3. Prepare TextBind Weights</a>
- <a href='#running_demo'>2.4. Running Demo</a>
- <a href='#train_textbind'>3. Train Your Own Models Using Our TextBind Recipe</a>
- <a href='#data_preparation'>3.1. Data Preparation</a>
- <a href='#prepare_blip2_qformer'>3.2. Prepare BLIP-2 Q-Former</a>
- <a href='#training_configurations'>3.3. Training Configurations</a>
- <a href='#training_textbind'>3.4. Training TextBind</a>
- <a href='#license'>Usage and License Notices</a>
- <a href='#citation'>Citation</a>
<span id='introduction'/>
1. Introduction: <a href='#content'>[Back to Top]</a>
<p align="center" width="100%"> <img src="./introduction.png" style="min-width: 300px; display: block; margin: auto;"> </p>Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
<span id='running_textbind'/>
2. Build Our Demo Locally: <a href='#content'>[Back to Top]</a>
<span id='install_environment'/>2.1. Install Environment:
Install the Pytorch package with the correct cuda version, for example
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
Then install the required environment, please run
pip install -r requirements.txt
<span id='prepare_vision_model'/>
2.2. Prepare Vsion Model:
Follow BLIP-2, we use EVA-CLIP as the vision model, you can run the following commands to prepare:
import torch
from transformers import Blip2ForConditionalGeneration
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl")
vision_model = model.vision_model
vision_model.save_pretrained("checkpoint/blip2_vision_model")
<span id='prepare_textbind_weights'/>
2.3. Prepare TextBind Weights:
|Base Language Model|Huggingface Weights Address|Maximum Sequence Length| |:-------------:|:-------------:|:-------------:| |Llama-2-7b-chat-hf|SihengLi/TextBind|768|
Then please put the downloaded checkpoints under the ./checkpoint/ directory
<span id='running_demo'/>2.4. Running Demo:
Please set the checkpoints in scripts/run_demo.sh as
CHECKPOINT=./checkpoint/second_stage_model.pt
VISION_MODEL=./checkpoint/blip2_vision_model
LANGUAGE_MODEL=meta-llama/Llama-2-7b-chat-hf
PROCESSOR=Salesforce/blip2-flan-t5-xxl
SD_BASE=stabilityai/stable-diffusion-xl-base-1.0
SD_REFINER=stabilityai/stable-diffusion-xl-refiner-1.0
Then you can run the demo locally as
bash scripts/run_demo.sh
<span id='train_textbind'/>
3. Train Your Own Models Using Our TextBind Recipe: <a href='#content'>[Back to Top]</a>
Prerequisites: Before training the model, making sure the environment is properly installed and the vision model has been prepared. You can refer to <a href='#install_environment'>[Here]</a> for more information.
<span id='data_preparation'/>3.1. Data Preparation:
Declaimer: To ensure the reproducibility of our results, we have released our training dataset. The dataset must be used for research purpose only.
|Training Stage|Dataset Address| |:-------------:|:-------------:| |Multimodal Alignment|CC3M+CC12M+SBU| |Multimodal Instruction Following|TextBind|
After downloading, put the downloaded file under the ./data/ directory.
For our textbind, you need to download the images manually using the url_list provided in the downloaded file and rename the download image according to the image_list.
**** The data directory should look like:
.
└── ./data/
└── /cc_sbu/
└── /cc_sbu_dataset/
└── {00000..01254}.tar
└── /textbind/
├── train.json
└── /images/
├── 490272.png
├── 862235.png
└── ...
<span id='prepare_blip2_qformer'/>
3.2. Prepare BLIP-2 Q-Former:
BLIP-2 Q-Former is utilized for the initialization of our Q-Former, run:
import torch
from transformers import Blip2ForConditionalGeneration
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl")
state_dict = model.state_dict()
state_dict = {key: value for key, value in state_dict.items() if key.split(".")[0] in ["query_tokens", "qformer"]}
torch.save(state_dict, "checkpoint/blip2_qformer.pt")
<span id='training_configurations'/>
3.3 Training Configurations:
The table below show the training hyperparameters used in our experiments. The hyperparameters are selected based on the constrain of our computational resources, i.e. 8 x A100 (40G) GPUs.
|Training Stage|Language Model|Epoch|Batch Size|Learning Rate|Training Modules| |:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:| |Multimodal Alignment|Llama-2-7b-chat-hf|2|256|1e-4|Q-Former, Linear| |Multimodal Instruction Following|Llama-2-7b-chat-hf|3|64|1e-5|QFormer, Linear, LLM|
<span id='training_textbind'/>3.4. Training TextBind:
For the multimodal alignment stage, please set the paths in scripts/run_first_stage.sh as
TRAIN_DATA_PATH=${your_first_stage_data_path}
CHECKPOINT=./checkpoint/blip2_qformer.pt
VISION_MODEL=./checkpoint/blip2_vision_model
LANGUAGE_MODEL=meta-llama/Llama-2-7b-chat-hf
PROCESSOR=Salesforce/blip2-flan-t5-xxl
then run the following commands:
bash scripts/run_first_stage.sh
For the multimodel instruction tuning stage, please set the paths in scripts/run_second_stage.sh as
TRAIN_DATA_PATH=${your_second_stage_data_path}
CHECKPOINT=${your_first_stage_model_path}
VISION_MODEL=./checkpoint/blip2_vision_model
LANGUAGE_MODEL=meta-llama/Llama-2-7b-chat-hf
PROCESSOR=Salesforce/blip2-flan-t5-xxl
then run the following commands:
bash scripts/run_second_stage.sh
<span id='license'/>
Usage and License Notices:
TextBind is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The delta weights are also CC BY NC 4.0 (allowing only non-commercial use).
<span id='citation'/>
Citation:
If you found TextBind useful in your research or applications, please kindly cite using the following BibTeX:
@article{li2023textbind,
title={TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild},
author={Li, Huayang and Li, Siheng and Cai, Deng and Wang, Longyue and Liu, Lemao and Watanabe, Taro and Yang, Yujiu and Shi, Shuming},
year={2023}
}
Related Skills
node-connect
353.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
353.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
353.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
