VidIL
Pytorch code for Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Install / Use
/learn @MikeWangWZHL/VidILREADME
VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners (NeurIPS 2022)
<img src="vidIL.gif" width="1200">Download Datasets & Checkpoints
-
Download dataset annotations zip from box or google drive. Then unzip the downloaded datasets under
shared_datasets/. The resulting shared_dataset folder structure is expected to be:shared_datasets ├── README.md ├── MSRVTT_caption ├── MSRVTT_qa ...Then, please refer to Dataset Instruction for downloading and processing raw videos.
-
Download BLIP checkpoints:
bash download_blip_checkpoints.sh -
Download Input & Output Examples zip from box or google dirve. Unzip the folders under
output_example/, the resultingoutput_example/folder structure is expected to be:output_example ├── msrvtt ├── msvd_test ├── vlep_test └── README.md -
[Update 6/17] GPT-3 Results for Video Captioning, Video Question Answering and VLEP can be downloaded here.
Set Up Environment
-
launch the docker environment:
- (1) set up variable "CKPT" and "DATASETS" as commented in
run_docker_vidil.sh - (2) run docker image
bash run_docker_vidil.sh
- (1) set up variable "CKPT" and "DATASETS" as commented in
-
set up GPU devices: within the docker image, set up the following environment variables to config GPT devices
export N_GPU=<num of gpus> export CUDA_VISIBLE_DEVICES=<0,1,2...>
Generate Video Representation & GPT-3 Prompt
- [Update 6/15] Quick Start with generated video representation: Frame captions and visual tokens for five datasets can be downloaded here if you don't want to go through the entire pipeline. You can copy the json files following the data structure as mentioned below.
The following scripts runs the entire pipeline which, (1) generates frame captions; (2) generates visual tokens (3) generates few-shot prompt readily for GPT-3. The output folder have the following structure:
{dataset_split}
├── frame_caption
│ ├── config.yaml # config for frame captioning
│ ├── video_text_Cap.json # frame captions w/o filtering
│ ├── video_text_CapFilt.json # frame captions w/ filtering
├── input_prompts
│ ├── {output_name}.jsonl # config for frame captioning
│ ├── {output_name}__idx_2_videoid.json # line idx to video id
│ ├── {output_name}__chosen_samples.json # chosen examples in the support
│ ...
├── visual_tokenization_{encoder_name}
│ ├── config.yaml # config for visual tokenization
│ └── visual_tokens.json # raw visual tokens of each frame
└──
All scripts should be run at /src dir, namely, the root directory after running the docker image. The following are examples for running the pipeline with in-context example selection for some datasets. For additional notes on running pipeline scripts, please refer to Pipeline Instruction.
Standalone Pipeline for Frame Captioning and Visaul Tokenization
Since we need to sample few-shot support set from training sets, for each dataset, at the first time running the pipeline, we need to do frame captioning and visual tokenization on the training set.
For <dataset> in ["msrvtt","youcook2","vatex","msvd","vlep"]:
bash pipeline/scripts/run_frame_captioning_and_visual_tokenization.sh <dataset> train <output_root>
An example of the frame caption and visual token dir can be found at: output_example/msrvtt/frame_caption , output_example/msrvtt/visual_tokenization_clip
Video Captioning
For <dataset> in ["msrvtt","youcook2","vatex"]:
-
(1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen
<dataset> -
(2) Run pipeline for generating video captioning prompts for
<dataset><split> in ["train","val","test"]- w/o ASR:
bash pipeline/scripts/generate_gpt3_query_pipeline_caption_with_in_context_selection.sh <dataset> <split> <output_root> 10 42 5 caption- w/ ASR:
bash pipeline/scripts/generate_gpt3_query_pipeline_caption_with_in_context_selection_with_asr.sh <dataset> <split> <output_root> 10 42 5 caption_asrAn example of the output prompt jsonl can be found at
output_example/msrvtt/input_prompts/temp_0.0_msrvtt_caption_with_in_context_selection_clip_shot_10_seed_42_N_5.jsonl.
Video Question Answering
For <dataset> in ["msrvtt","msvd"]:
-
(1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen
<dataset> -
(2) Run pipeline for generating video question answering prompts for
<dataset><split> in ["train","val","test"]bash pipeline/scripts/generate_gpt3_query_pipeline_qa_with_in_context_selection.sh <dataset> <split> <output_root> 5 42 5 questionAn example of the output prompt jsonl can be found at
output_example/msvd_test/input_prompts/temp_0.0_gpt3_queries_msvd_qa_clip_shot_5_seed_42.jsonl.
Video-Language Event Prediction (VLEP)
-
(1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen
vlep -
(2) Run pipeline for generating vlep prompts
bash pipeline/scripts/generate_gpt3_query_pipeline_vlep_with_random_context_asr_multichoice.sh <dataset> <split> <output_root> 10 42An example of the output prompt jsonl can be found at
output_example/vlep_test/input_prompts/temp_0.0_vlep_test_clip_shot_10_seed_42_multichoice.jsonl.
Semi-Supervised Text-Video Retrieval
For semi-supervised setting, we first generate pseudo label on the training set, we then train BLIP on the pseudo labeled dataset for retrieval.
-
(1) Generate pseudo labeled training set annotation json: suppose we have the raw gpt3 response stored at
<gpt3_response_dir>, the input_prompt dir is at<input_prompts_dir>, run:python utils_gpt3/process_gpt3_response.py --gpt3_response_dir <gpt3_response_dir> --input_prompts_dir <input_prompts_dir> --output_dir <processed_response_dir> python utils_gpt3/gpt3_response_to_jsonl.py --dataset <dataset_name> --gpt3_processed_dir <processed_response_dir> --output_dir <pseudo_label_ann_output_dir>An example of the
<gpt3_response_dir>,<input_prompts_dir>,<processed_response_dir>andpseudo_label_ann_output_dircan be found atoutput_example/msrvtt/gpt3_response,output_example/msrvtt/input_prompts,output_example/msrvtt/processed_response_dir,output_example/msrvtt/pseudo_label_ann. -
(2) Finetune pretrained BLIP from pseudo labeled data: For
<dataset> in ["msrvtt","vatex"], set the value of the field namedtrain_ann_jsonlinconfigs/train_blip_video_retrieval_<dataset>_pseudo.yamlto be the path to the output jsonl from step one in<pseudo_label_ann_output_dir>. Then run:bash scripts/train_caption_video.sh train_blip_video_retrieval_<dataset>_pseudo
Evaluation
Scripts for evaluating generation results from GPT-3:
-
Video Captioning: please refer to the example written in the script for more details about the required inputs
bash scripts/evaluation/eval_caption_from_gpt3_response.sh -
Question Answering: please refer to the example written in the script for more details about the required inputs
bash scripts/evaluation/eval_qa_from_gpt3_response.sh -
VLEP:
-
(1) get the processed gpt3 response; an example of the:
<gpt3_response_dir>,<input_prompts_dir>and<processed_response_dir>can be found at:output_example/vlep_test/gpt3_response,output_example/vlep_test/input_prompts,output_example/vlep_test/gpt3_response_processedpython utils_gpt3/process_gpt3_response.py --gpt3_response_dir <gpt3_response_dir> --input_prompts_dir <input_prompts_dir> --output_dir <processed_response_dir> -
(2) run the following script to generate the output in the official format for CodaLab submission; an example of the output jsonl can be found at
output_example/vlep_test/evaluation/temp_0.0_vlep_test_clip_shot_10_seed_42_multichoice_eval.jsonlpython eval_vlep.py --gpt3_processed_response <processed_response_json> --output_path <output_jsonl_path>
-
Citation
@article{wang2022language,
title={Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners},
author={Wang, Zhenhailong and Li, Manling and Xu, Ruochen and Zhou, Luowei and Lei, Jie and Lin, Xudong and Wang, Shuohang and Yang, Ziyi and Zhu, Chenguang and Hoiem, Derek and others},
journal={arXiv preprint arXiv:2205.10747},
year={2022}
}
Acknowledgement
The implementation of VidIL relies on resources from BLIP, ALPRO, transformers. We thank the original authors for their open-sourced code and encourage users to cite their works when applicable.
Related Skills
qqbot-channel
349.0kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.3k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
349.0kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
