SkillAgentSearch skills...

VALOR

[TPAMI2024] Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Install / Use

/learn @CASIA-IVA-Lab/VALOR

README

[TPAMI2024] VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

<div align=center><img src=img/img_radar.png/ width="75%" height="75%"></div>
  • This is the official repository of VALOR which provides training&testing code and pretraining checkpoints.
  • VALOR-32K dataset (annotation) can be downloaded from BaiduDiskLink. Raw videos can be downloaded from BaiduDiskLink.
  • VALOR-1M dataset (annotation) can be downloaded from BaiduDiskLink Raw videos can be downloaded from YouTube.
  • Paper w audio files embeded in PDF can be found on project page.
  • We have proposed a stronger vision-audio-subtitle-text omni-modality foundation model (VAST), Paper, Github page.
  • We have proposed a new strong video-language pretraining model (COSA), Paper, Code.

PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC PWC

<div align=center><img src=img/img_model.png/></div>

Building Environment

  • VALOR is implemented based on Pytorch. We use pytorch-1.9.0 and cuda-11.1. Other version could be also compatible.
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
  • build apex.
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
  • install needed packages.
sh preinstall.sh

Download Checkpoints

  • pretrained_weights (BERT,CLIP,VideoSwin). Put pretrained_weights dir under main path. (VALOR/pretrained_weights)
  • VALOR models.

| Model | Pretrained Ckpt | Finetuned Ckpt on MSRVTT-Retrieval | Finetuned Ckpt on MSRVTT-Caption | |---------|-----------------|------------------------------------|----------------------------------| | VALOR-B | VALOR-base | VALOR_base_msr_ret.pt | VALOR_base_msr_cap.pt | | VALOR-L | VALOR-large | VALOR_large_msr_ret.pt | VALOR_large_msr_cap.pt |

Put VALOR-base and VALOR-large under the output dir. (VALOR/output/VALOR-base, VALOR/output/VALOR-large)

Prepare Datasets

VALOR is pretrained and tested on multiple vision-language, audio-language and audiovisual-language datasets. e.g. PRETRAIN: VALOR-1M, WebVid-2.5M, CC-3M (VALOR-base) TEST: VALOR-32K, MSRVTT, MSVD, DiDeMo, LSMDC, ActivityNet, VATEX, AudioCaps, ClothoV1, TGIF-Frame, MSCOCO, VQAV2... We here take MSRVTT as an example to show the data processing procedures, other datasets take a similar way.

  • make dir VALOR/datasets/MSRVTT
  • download raw videos from website, and put them in MSRVTT/raw_videos
  • extract video frames (.jpg) and audio files (.wav). Utilizing utils/extract_frame_and_wav_multiprocess.py (Note: VALOR use this offline extracted frames and audios for training and testing for it's fast I/O speed. You may adjust to read raw videos via decord library, and need to change VideoMapper and AudioMapper classes in data/data.py.)
  • prepare id_files (standardsplit_train_id.json, standardsplit_test_id.json, 1KAsplit_train_id.json, 1KAsplit_test_id.json). The format is List(Str) ['video0', 'video1', ...]. The former two are for video captioning and video qa, while the latter two are for video retrieval.
  • prepare txt_mapper.json. txt_mapper files map videoIDs to its descriptions. Format
View on GitHub
GitHub Stars307
CategoryEducation
Updated9d ago
Forks18

Languages

Python

Security Score

100/100

Audited on Mar 28, 2026

No findings