SkillAgentSearch skills...

VAST

[NIPS2023] Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Install / Use

/learn @CASIA-IVA-Lab/VAST

README

[NIPS2023]VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

<div align=center><img src=img/radar_compare_alldata_vast.png/ width="75%" height="75%"></div>

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

<div align=center><img src=img/VAST-model.jpg/></div>

Building Environment

VAST is implemented based on Pytorch. We use Python-3.9 and Cuda-11.7. Other version could be also compatible. Other needed packages are listed in preinstall.sh.

conda create -n vast python=3.9
conda activate vast
sh preinstall.sh

Download basic encoder's pretrained checkpoints

make a dir named pretrained_weights under the main work dir.

1.download evaclip weight:

wget -P pretrained_weights/clip/ https://huggingface.co/QuanSun/EVA-CLIP/resolve/main/EVA01_CLIP_g_14_psz14_s11B.pt

2.download beats weight from https://github.com/microsoft/unilm/tree/master/beats

3.download bert weight:

from transformers import BertModel, BertTokenizer
bert = BertModel.from_pretrained('bert-base-uncased')
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert.save_pretrained('pretrained_weights/bert/bert-base-uncased')
bert_tokenizer.save_pretrained('pretrained_weights/bert/bert-base-uncased')

The processed pretrained_weights path should be as follows:

    ├── pretrained_weights
    │   ├── beats
    │   │   └── BEATs_iter3_plus_AS2M.pt
    │   ├── bert
    │   │   └── bert-base-uncased
    │   ├── clip
    │   │   └── EVA01_CLIP_g_14_psz14_s11B.pt

Download VAST models and captioners (for labeling your own data)

make a dir named output under the main work dir.

1.download vast model (optional, for finetuning)

[Google Drive Link] [Baidu Cloud Link]

2.vision captioner (optional, for labeling images/videos)

[Google Drive Link] [Baidu Cloud Link]

3.audio captioner (optional, for labeling audios)

[Google Drive Link] [Baidu Cloud Link]

The processed output path should be as follows:

    ├── output
    │   ├── vast
    │   │   ├── pretrain_vast
    │   │   ├── vision_captioner
    │   │   └── audio_captioner

Download VAST-27M annotations for pretraining

[Google Drive Link] [Baidu Cloud Link]

Raw videos could be downloaded from YouTube.

Download downstream datasets annotations for finetuning

make a dir named datasets under the main work dir.

[Google Drive Link] [Baidu Cloud Link]

The processed datasets path should be as follows:

    ├── output
    │   ├── annotations
    │   │   ├── msrvtt
    │   │   ├── ...
    │   │   └── msvd
    │   ├── srcdata
    │   │   ├── msrvtt
    │   │   ├── ...
    │   │   └── msvd

srcdata (images/videos/audios) should be collected by yourself.

Finetune Model

  • finetune retrieval tasks
sh scripts/vast/finetune_ret.sh
  • finetune captioning tasks
sh scripts/vast/finetune_cap.sh
  • finetune QA tasks
sh scripts/vast/finetune_qa.sh

Pretrain Model

sh scripts/pretrain_vast.sh

Test your finetuned Model

For example, if the cmd for finetuning retrieval model is as follows:

python3 -m torch.distributed.launch \
--nnodes 1 \
--node_rank 0 \
--nproc_per_node 8 \
--master_port 9834 \
./run.py \
--learning_rate 2e-5 \
--checkpointing true \
--first_eval true \
--save_best true \
--config ./config/vast/finetune_cfg/retrieval-msrvtt.json \
--pretrain_dir $output_dir \
--output_dir $output_dir/downstream/retrieval-msrvtt \

if you want to test model, just add following two rows to the cmd:

--mode 'testing' \
--checkpoint /PATH/TO/SAVED_CHECKPOINT.pt

Labeling your own data use vast's captioner

You need to prepare 1)a folder containing all videos/images or audios.

2)a meta.json composed of [{'video_id':'09WssDay9FE_1'},{'video_id':'09WssDay9FE_2'},...]

and then write the config file.

sh scripts/vast/vision_captioner.sh
sh scripts/vast/audio_captioner.sh

Statement of common controllable items in cmd which can overwrite config files.

--train_vision_sample_num

--test_vision_sample_num

--train_audio_sample_num

--test_audio_sample_num

--train_task

--test_task

--learning_rate

--train_batch_size

--test_batch_size

--train_epoch

--train_steps

--checkpointing

--frozen_vision

--valid_freq

--beam_size

Citation

If you find this code useful for your research, please consider citing:

@article{chen2024vast,
  title={Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset},
  author={Chen, Sihan and Li, Handong and Wang, Qunbo and Zhao, Zijia and Sun, Mingzhen an
View on GitHub
GitHub Stars299
CategoryDevelopment
Updated13d ago
Forks18

Languages

Jupyter Notebook

Security Score

100/100

Audited on Mar 24, 2026

No findings