MemVP
[ICML 2024] Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
Install / Use
/learn @JieShibo/MemVPREADME
MemVP
Official code of ''Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning''
<p align="left"> <a href="https://arxiv.org/abs/2405.05615" alt="arXiv"> <img src="https://img.shields.io/badge/arXiv-2405.05615-b31b1b.svg?style=flat" /></a> </p> <p align="center"> <img src="./figs/fig1.png" width="700"> </p>Environment
conda create -n memvp python==3.10
conda activate memvp
pip install -r requirements.txt
pip install -e .
TODO
- [x] Code of experiments on LLaMA.
- [ ] Code of experiments on BART and T5.
Preparation
- For ScienceQA, please refer to the official repo.
- For the weights of LLaMA, please refer to the official form or unofficial HuggingFace repo LLaMA-7B and LLaMA-13B.
<your path>/
|-- memvp
|-- scripts
|-- train.py
|-- eval.py
......
|-- data/
|-- problem.json
|-- pid_splits.json
|-- captions.json
|-- images
|-- train # ScienceQA train image
|-- val # ScienceQA val image
|-- test # ScienceQA test image
|-- weights
|-- tokenizer.model
|--7B
|-- params.json
|-- consolidated.00.pth
|--13B
|-- params.json
|-- consolidated.00.pth
|-- consolidated.01.pth
Fine-Tuning & Inference
# LLaMA-7B
bash scripts/finetuning_sqa_7b.sh
bash scripts/eval_sqa_7b.sh
# LLaMA-13B
bash scripts/finetuning_sqa_13b.sh
bash scripts/eval_sqa_13b.sh
Fine-tuning takes around 40 minutes for LLaMA-7B and 1 hour for LLaMA-13B on 8x A800 (80G).
Acknowledgements
Citation
@article{jie2024memvp,
title={Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning},
author={Jie, Shibo and Tang, Yehui and Ding, Ning and Deng, Zhi-Hong and Han, Kai and Wang, Yunhe},
journal={arXiv preprint arXiv:2405.05615},
year={2024}
}
