VidIL

Pytorch code for Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

Generate Convert Improve

Install / Use

/learn @MikeWangWZHL/VidIL

About this skill

Quality Score

0/100

README

VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners (NeurIPS 2022)

Download Datasets & Checkpoints

Download dataset annotations zip from box or google drive. Then unzip the downloaded datasets under shared_datasets/. The resulting shared_dataset folder structure is expected to be:
```
shared_datasets
├── README.md
├── MSRVTT_caption
├── MSRVTT_qa
...
```
Then, please refer to Dataset Instruction for downloading and processing raw videos.
Download BLIP checkpoints:
```
bash download_blip_checkpoints.sh
```
Download Input & Output Examples zip from box or google dirve. Unzip the folders under output_example/, the resulting output_example/ folder structure is expected to be:
```
output_example
├── msrvtt
├── msvd_test
├── vlep_test
└── README.md
```
[Update 6/17] GPT-3 Results for Video Captioning, Video Question Answering and VLEP can be downloaded here.

Set Up Environment

launch the docker environment:
- (1) set up variable "CKPT" and "DATASETS" as commented in run_docker_vidil.sh
- (2) run docker image
```
bash run_docker_vidil.sh
```
set up GPU devices: within the docker image, set up the following environment variables to config GPT devices
```
export N_GPU=<num of gpus>
export CUDA_VISIBLE_DEVICES=<0,1,2...>
```

Generate Video Representation & GPT-3 Prompt

[Update 6/15] Quick Start with generated video representation: Frame captions and visual tokens for five datasets can be downloaded here if you don't want to go through the entire pipeline. You can copy the json files following the data structure as mentioned below.

The following scripts runs the entire pipeline which, (1) generates frame captions; (2) generates visual tokens (3) generates few-shot prompt readily for GPT-3. The output folder have the following structure:

    {dataset_split}
    ├── frame_caption
    │   ├── config.yaml  # config for frame captioning
    │   ├── video_text_Cap.json  # frame captions w/o filtering
    │   ├── video_text_CapFilt.json  # frame captions w/ filtering
    ├── input_prompts 
    │   ├── {output_name}.jsonl  # config for frame captioning
    │   ├── {output_name}__idx_2_videoid.json  # line idx to video id
    │   ├── {output_name}__chosen_samples.json  # chosen examples in the support
    │   ... 
    ├── visual_tokenization_{encoder_name}           
    │   ├── config.yaml  # config for visual tokenization
    │   └── visual_tokens.json  # raw visual tokens of each frame
    └──

All scripts should be run at /src dir, namely, the root directory after running the docker image. The following are examples for running the pipeline with in-context example selection for some datasets. For additional notes on running pipeline scripts, please refer to Pipeline Instruction.

Standalone Pipeline for Frame Captioning and Visaul Tokenization

Since we need to sample few-shot support set from training sets, for each dataset, at the first time running the pipeline, we need to do frame captioning and visual tokenization on the training set.

For <dataset> in ["msrvtt","youcook2","vatex","msvd","vlep"]:

bash pipeline/scripts/run_frame_captioning_and_visual_tokenization.sh <dataset> train <output_root>

An example of the frame caption and visual token dir can be found at: output_example/msrvtt/frame_caption , output_example/msrvtt/visual_tokenization_clip

Video Captioning

For <dataset> in ["msrvtt","youcook2","vatex"]:

(1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen <dataset>

(2) Run pipeline for generating video captioning prompts for <dataset> <split> in ["train","val","test"]

w/o ASR:

bash pipeline/scripts/generate_gpt3_query_pipeline_caption_with_in_context_selection.sh <dataset> <split> <output_root> 10 42 5 caption

w/ ASR:

bash pipeline/scripts/generate_gpt3_query_pipeline_caption_with_in_context_selection_with_asr.sh <dataset> <split> <output_root> 10 42 5 caption_asr

An example of the output prompt jsonl can be found at output_example/msrvtt/input_prompts/temp_0.0_msrvtt_caption_with_in_context_selection_clip_shot_10_seed_42_N_5.jsonl.

Video Question Answering

For <dataset> in ["msrvtt","msvd"]:

(1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen <dataset>
(2) Run pipeline for generating video question answering prompts for <dataset> <split> in ["train","val","test"]
```
bash pipeline/scripts/generate_gpt3_query_pipeline_qa_with_in_context_selection.sh <dataset> <split> <output_root> 5 42 5 question
```
An example of the output prompt jsonl can be found at output_example/msvd_test/input_prompts/temp_0.0_gpt3_queries_msvd_qa_clip_shot_5_seed_42.jsonl.

Video-Language Event Prediction (VLEP)

(1) Run the Standalone Frame Captioning and Visaul Tokenization pipieline for the chosen vlep
(2) Run pipeline for generating vlep prompts
```
    bash pipeline/scripts/generate_gpt3_query_pipeline_vlep_with_random_context_asr_multichoice.sh <dataset> <split> <output_root> 10 42
```
An example of the output prompt jsonl can be found at output_example/vlep_test/input_prompts/temp_0.0_vlep_test_clip_shot_10_seed_42_multichoice.jsonl.

Semi-Supervised Text-Video Retrieval

For semi-supervised setting, we first generate pseudo label on the training set, we then train BLIP on the pseudo labeled dataset for retrieval.

(1) Generate pseudo labeled training set annotation json: suppose we have the raw gpt3 response stored at <gpt3_response_dir>, the input_prompt dir is at <input_prompts_dir>, run:
```
    python utils_gpt3/process_gpt3_response.py --gpt3_response_dir <gpt3_response_dir> --input_prompts_dir <input_prompts_dir> --output_dir <processed_response_dir>
    python utils_gpt3/gpt3_response_to_jsonl.py --dataset <dataset_name> --gpt3_processed_dir <processed_response_dir> --output_dir <pseudo_label_ann_output_dir>
```
An example of the <gpt3_response_dir>, <input_prompts_dir>, <processed_response_dir> and pseudo_label_ann_output_dir can be found at output_example/msrvtt/gpt3_response, output_example/msrvtt/input_prompts, output_example/msrvtt/processed_response_dir, output_example/msrvtt/pseudo_label_ann.
(2) Finetune pretrained BLIP from pseudo labeled data: For <dataset> in ["msrvtt","vatex"], set the value of the field named train_ann_jsonl in configs/train_blip_video_retrieval_<dataset>_pseudo.yaml to be the path to the output jsonl from step one in <pseudo_label_ann_output_dir>. Then run:
```
bash scripts/train_caption_video.sh train_blip_video_retrieval_<dataset>_pseudo
```

Evaluation

Scripts for evaluating generation results from GPT-3:

Video Captioning: please refer to the example written in the script for more details about the required inputs
```
bash scripts/evaluation/eval_caption_from_gpt3_response.sh
```
Question Answering: please refer to the example written in the script for more details about the required inputs
```
bash scripts/evaluation/eval_qa_from_gpt3_response.sh
```
VLEP:
- (1) get the processed gpt3 response; an example of the: <gpt3_response_dir>, <input_prompts_dir> and <processed_response_dir> can be found at: output_example/vlep_test/gpt3_response, output_example/vlep_test/input_prompts, output_example/vlep_test/gpt3_response_processed
```
    python utils_gpt3/process_gpt3_response.py --gpt3_response_dir <gpt3_response_dir> --input_prompts_dir <input_prompts_dir> --output_dir <processed_response_dir>
```
- (2) run the following script to generate the output in the official format for CodaLab submission; an example of the output jsonl can be found at output_example/vlep_test/evaluation/temp_0.0_vlep_test_clip_shot_10_seed_42_multichoice_eval.jsonl
```
    python eval_vlep.py --gpt3_processed_response <processed_response_json> --output_path <output_jsonl_path>
```

Citation

@article{wang2022language,
  title={Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners},
  author={Wang, Zhenhailong and Li, Manling and Xu, Ruochen and Zhou, Luowei and Lei, Jie and Lin, Xudong and Wang, Shuohang and Yang, Ziyi and Zhu, Chenguang and Hoiem, Derek and others},
  journal={arXiv preprint arXiv:2205.10747},
  year={2022}
}

Acknowledgement

The implementation of VidIL relies on resources from BLIP, ALPRO, transformers. We thank the original authors for their open-sourced code and encourage users to cite their works when applicable.

Related Skills

qqbot-channel

349.0k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.3k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

349.0k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

MikeWangWZHL

View profile

View on GitHub

GitHub Stars117

CategoryContent

Updated21d ago

Forks1

MikeWangWZHL/VidIL

Languages

Python

Security Score

100/100

Audited on Mar 15, 2026

No findings