VAST
[NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Install / Use
/learn @CASIA-IVA-Lab/VASTREADME
[NIPS2023]VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
<div align=center><img src=img/radar_compare_alldata_vast.png/ width="75%" height="75%"></div> <div align=center><img src=img/VAST-model.jpg/></div>Building Environment
VAST is implemented based on Pytorch. We use Python-3.9 and Cuda-11.7. Other version could be also compatible. Other needed packages are listed in preinstall.sh.
conda create -n vast python=3.9
conda activate vast
sh preinstall.sh
Download basic encoder's pretrained checkpoints
make a dir named pretrained_weights under the main work dir.
1.download evaclip weight:
wget -P pretrained_weights/clip/ https://huggingface.co/QuanSun/EVA-CLIP/resolve/main/EVA01_CLIP_g_14_psz14_s11B.pt
2.download beats weight from https://github.com/microsoft/unilm/tree/master/beats
3.download bert weight:
from transformers import BertModel, BertTokenizer
bert = BertModel.from_pretrained('bert-base-uncased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert.save_pretrained('pretrained_weights/bert/bert-base-uncased')
bert_tokenizer.save_pretrained('pretrained_weights/bert/bert-base-uncased')
The processed pretrained_weights path should be as follows:
├── pretrained_weights
│ ├── beats
│ │ └── BEATs_iter3_plus_AS2M.pt
│ ├── bert
│ │ └── bert-base-uncased
│ ├── clip
│ │ └── EVA01_CLIP_g_14_psz14_s11B.pt
Download VAST models and captioners (for labeling your own data)
make a dir named output under the main work dir.
1.download vast model (optional, for finetuning)
[Google Drive Link] [Baidu Cloud Link]
2.vision captioner (optional, for labeling images/videos)
[Google Drive Link] [Baidu Cloud Link]
3.audio captioner (optional, for labeling audios)
[Google Drive Link] [Baidu Cloud Link]
The processed output path should be as follows:
├── output
│ ├── vast
│ │ ├── pretrain_vast
│ │ ├── vision_captioner
│ │ └── audio_captioner
Download VAST-27M annotations for pretraining
[Google Drive Link] [Baidu Cloud Link]
Raw videos could be downloaded from YouTube.
Download downstream datasets annotations for finetuning
make a dir named datasets under the main work dir.
[Google Drive Link] [Baidu Cloud Link]
The processed datasets path should be as follows:
├── output
│ ├── annotations
│ │ ├── msrvtt
│ │ ├── ...
│ │ └── msvd
│ ├── srcdata
│ │ ├── msrvtt
│ │ ├── ...
│ │ └── msvd
srcdata (images/videos/audios) should be collected by yourself.
Finetune Model
- finetune retrieval tasks
sh scripts/vast/finetune_ret.sh
- finetune captioning tasks
sh scripts/vast/finetune_cap.sh
- finetune QA tasks
sh scripts/vast/finetune_qa.sh
Pretrain Model
sh scripts/pretrain_vast.sh
Test your finetuned Model
For example, if the cmd for finetuning retrieval model is as follows:
python3 -m torch.distributed.launch \
--nnodes 1 \
--node_rank 0 \
--nproc_per_node 8 \
--master_port 9834 \
./run.py \
--learning_rate 2e-5 \
--checkpointing true \
--first_eval true \
--save_best true \
--config ./config/vast/finetune_cfg/retrieval-msrvtt.json \
--pretrain_dir $output_dir \
--output_dir $output_dir/downstream/retrieval-msrvtt \
if you want to test model, just add following two rows to the cmd:
--mode 'testing' \
--checkpoint /PATH/TO/SAVED_CHECKPOINT.pt
Labeling your own data use vast's captioner
You need to prepare 1)a folder containing all videos/images or audios.
2)a meta.json composed of [{'video_id':'09WssDay9FE_1'},{'video_id':'09WssDay9FE_2'},...]
and then write the config file.
sh scripts/vast/vision_captioner.sh
sh scripts/vast/audio_captioner.sh
Statement of common controllable items in cmd which can overwrite config files.
--train_vision_sample_num
--test_vision_sample_num
--train_audio_sample_num
--test_audio_sample_num
--train_task
--test_task
--learning_rate
--train_batch_size
--test_batch_size
--train_epoch
--train_steps
--checkpointing
--frozen_vision
--valid_freq
--beam_size
Citation
If you find this code useful for your research, please consider citing:
@article{chen2024vast,
title={Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset},
author={Chen, Sihan and Li, Handong and Wang, Qunbo and Zhao, Zijia and Sun, Mingzhen an
