[NIPS2023]VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Building Environment

VAST is implemented based on Pytorch. We use Python-3.9 and Cuda-11.7. Other version could be also compatible. Other needed packages are listed in preinstall.sh.

conda create -n vast python=3.9
conda activate vast
sh preinstall.sh

Download basic encoder's pretrained checkpoints

make a dir named pretrained_weights under the main work dir.

1.download evaclip weight:

wget -P pretrained_weights/clip/ https://huggingface.co/QuanSun/EVA-CLIP/resolve/main/EVA01_CLIP_g_14_psz14_s11B.pt

2.download beats weight from https://github.com/microsoft/unilm/tree/master/beats

3.download bert weight:

from transformers import BertModel, BertTokenizer
bert = BertModel.from_pretrained('bert-base-uncased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert.save_pretrained('pretrained_weights/bert/bert-base-uncased')
bert_tokenizer.save_pretrained('pretrained_weights/bert/bert-base-uncased')

The processed pretrained_weights path should be as follows:

    ├── pretrained_weights
    │   ├── beats
    │   │   └── BEATs_iter3_plus_AS2M.pt
    │   ├── bert
    │   │   └── bert-base-uncased
    │   ├── clip
    │   │   └── EVA01_CLIP_g_14_psz14_s11B.pt

Download VAST models and captioners (for labeling your own data)

make a dir named output under the main work dir.

1.download vast model (optional, for finetuning)

[Google Drive Link] [Baidu Cloud Link]

2.vision captioner (optional, for labeling images/videos)

[Google Drive Link] [Baidu Cloud Link]

3.audio captioner (optional, for labeling audios)

[Google Drive Link] [Baidu Cloud Link]

The processed output path should be as follows:

    ├── output
    │   ├── vast
    │   │   ├── pretrain_vast
    │   │   ├── vision_captioner
    │   │   └── audio_captioner

Download VAST-27M annotations for pretraining

[Google Drive Link] [Baidu Cloud Link]

Raw videos could be downloaded from YouTube.

Download downstream datasets annotations for finetuning

make a dir named datasets under the main work dir.

[Google Drive Link] [Baidu Cloud Link]

The processed datasets path should be as follows:

    ├── output
    │   ├── annotations
    │   │   ├── msrvtt
    │   │   ├── ...
    │   │   └── msvd
    │   ├── srcdata
    │   │   ├── msrvtt
    │   │   ├── ...
    │   │   └── msvd

srcdata (images/videos/audios) should be collected by yourself.

Finetune Model

finetune retrieval tasks

sh scripts/vast/finetune_ret.sh

finetune captioning tasks

sh scripts/vast/finetune_cap.sh

finetune QA tasks

sh scripts/vast/finetune_qa.sh

Pretrain Model

sh scripts/pretrain_vast.sh

Test your finetuned Model

For example, if the cmd for finetuning retrieval model is as follows:

python3 -m torch.distributed.launch \
--nnodes 1 \
--node_rank 0 \
--nproc_per_node 8 \
--master_port 9834 \
./run.py \
--learning_rate 2e-5 \
--checkpointing true \
--first_eval true \
--save_best true \
--config ./config/vast/finetune_cfg/retrieval-msrvtt.json \
--pretrain_dir $output_dir \
--output_dir $output_dir/downstream/retrieval-msrvtt \

if you want to test model, just add following two rows to the cmd:

--mode 'testing' \
--checkpoint /PATH/TO/SAVED_CHECKPOINT.pt

Labeling your own data use vast's captioner

You need to prepare 1)a folder containing all videos/images or audios.

2)a meta.json composed of [{'video_id':'09WssDay9FE_1'},{'video_id':'09WssDay9FE_2'},...]

and then write the config file.

sh scripts/vast/vision_captioner.sh
sh scripts/vast/audio_captioner.sh

Statement of common controllable items in cmd which can overwrite config files.

--train_vision_sample_num

--test_vision_sample_num

--train_audio_sample_num

--test_audio_sample_num

--train_task

--test_task

--learning_rate

--train_batch_size

--test_batch_size

--train_epoch

--train_steps

--checkpointing

--frozen_vision

--valid_freq

--beam_size

Citation

If you find this code useful for your research, please consider citing:

@article{chen2024vast,
  title={Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset},
  author={Chen, Sihan and Li, Handong and Wang, Qunbo and Zhao, Zijia and Sun, Mingzhen an

VAST

Install / Use

README

[NIPS2023]VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Building Environment

Download basic encoder's pretrained checkpoints

Download VAST models and captioners (for labeling your own data)

Download VAST-27M annotations for pretraining

Download downstream datasets annotations for finetuning

Finetune Model

Pretrain Model

Test your finetuned Model

Labeling your own data use vast's captioner

Statement of common controllable items in cmd which can overwrite config files.

Citation