SkillAgentSearch skills...

SwinBERT

Research code for CVPR 2022 paper "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning"

Install / Use

/learn @microsoft/SwinBERT
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

SwinBERT

<img src="docs/swinbert-overview.png" width="650">

This is our research code for CVPR 2022 paper: SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning.

We present SwinBERT, an end-to-end transformer-based model for video captioning. SwinBERT takes video frame patches directly as inputs, and outputs a natural language description. In this repository, we provide our research code for training and testing SwinBERT for video captioning.

News

  • 05/05/2022: Init release

Released items

  • [x] Training and evaluation code
  • [x] Inference code
  • [x] Models and training logs
  • [x] TSV dataset annotations
  • [ ] Tutorial for Frame-based TSV generation

Table of contents

Model Card

  • We release our best performing checkpoints for each dataset (corresponding to Table 1 in our paper). For clarity, we report performance on both validation and test splits below.

  • We also report our results on private test splits, where the scores are obtained from VALUE Leaderboard Evaluation Server.

| Dataset | Checkpoint | CIDEr (val split) | CIDEr (test split) | CIDEr (private test split) | | --------------- | :-------------: | :-------------: | :-------------: | :-------------: | | VATEX | URL | 84.4 | 73.0 | 74.35 | | MSRVTT | URL | 55.1 | 53.8 | N/A | | MSVD | URL | 160 | 120.6 | N/A | | TVC | URL | 57.0 | N/A | 49.74 | | YouCook2 | URL | 109 | N/A | 101.39 |

  • We also release our 32-frame model below.

| Dataset | Checkpoint | CIDEr (val split) | CIDEr (test split) | CIDEr (private test split) | | --------------- | :-------------: | :-------------: | :-------------: | :-------------: | | VATEX | URL | 82.1 | 71.6 | 73.06 | | MSRVTT | URL | 55.1 | 53.8 | N/A | | MSVD | URL | 147.6 | 109.4 | N/A | | TVC | URL | 53.8 | N/A | 47.6 | | YouCook2 | URL | 104.8 | N/A | 97.69 |

  • Note: All results are based on single model. No CIDEr optimization used in our experiments.

Requirements

We provide a Docker image for easier reproduction. Please install the following:

We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended. Our scripts require the user to have the docker group membership so that docker commands can be run without sudo.

Download

  1. Create folders that store pretrained models, datasets, and predictions.

    export REPO_DIR=$PWD
    mkdir -p $REPO_DIR/models  # pre-trained models
    mkdir -p $REPO_DIR/datasets  # datasets
    mkdir -p $REPO_DIR/predictions  # prediction outputs
    
  2. Download pretrained models.

    Our pre-trained models can be downloaded with the following command.

    cd $REPO_DIR
    bash scripts/download_models.sh
    

    The script will download our models that are trained for VATEX, MSRVTT, MSVD, TVC and YouCook2, respectively. It will also download our training logs and output predictions.

    The resulting data structure should follow the hierarchy as below.

    ${REPO_DIR}  
    |-- models  
    |   |-- table1
    |   |   |-- vatex
    |   |   |   |-- best-checkpoint
    |   |   |   |   |-- model.bin
    |   |   |   |   |-- optmizer_state.bin
    |   |   |   |   |-- pred.*
    |   |   |   |-- tokenizer
    |   |   |   |   |-- added_tokens.json
    |   |   |   |   |-- special_tokens_map.json
    |   |   |   |   |-- vocab.txt
    |   |   |   |-- log
    |   |   |   |   |-- log.txt
    |   |   |   |   |-- args.json
    |   |   |-- msrvtt
    |   |   |-- msvd
    |   |   |-- tvc
    |   |   |-- youcook2
    |   |-- 32frm
    |   |   |-- vatex
    |   |   |   |-- best-checkpoint
    |   |   |   |   |-- model.bin
    |   |   |   |   |-- optmizer_state.bin
    |   |   |   |   |-- pred.*
    |   |   |   |-- tokenizer
    |   |   |   |   |-- added_tokens.json
    |   |   |   |   |-- special_tokens_map.json
    |   |   |   |   |-- vocab.txt
    |   |   |   |-- log
    |   |   |   |   |-- log.txt
    |   |   |   |   |-- args.json
    |   |   |-- msrvtt
    |   |   |-- msvd
    |   |   |-- tvc
    |   |   |-- youcook2
    |-- docs 
    |-- src
    |-- scripts 
    |-- README.md 
    |-- ... 
    |-- ... 
    
  3. Download pretrained Video Swin Transformers.

    To run our code smoothly, please visit Video Swin Transformer to download pre-trained weights models.

    Download swin_base_patch244_window877_kinetics*_22k.pth, and place them under ${REPO_DIR}/models/video_swin_transformer directory. The data structure should follow the hierarchy below.

    ${REPO_DIR}  
    |-- models  
    |   |-- video_swin_transformer
    |    |   |-- swin_base_patch244_window877_kinetics600_22k.pth
    |    |   |-- swin_base_patch244_window877_kinetics400_22k.pth
    |   |-- table1
    |   |-- 32frm
    |-- docs 
    |-- src
    |-- scripts 
    |-- README.md 
    |-- ... 
    |-- ... 
    
  4. Download prediction files that were evaluated on VALUE Leaderboard Evaluation Server

    The prediction files can be downloaded with the following command.

    cd $REPO_DIR
    bash scripts/download_value_preds.sh
    

    You could submit the prediction files to VALUE Leaderboard and reproduce our results.

  5. Download datasets for training and evaluation

    In this project, we provide our pre-parsed annotation files in TSV format. To download the files, please use the following command.

    cd $REPO_DIR
    bash scripts/download_annotations.sh
    

    Following prior studies, we use the standard train/val/test splits for each dataset. Here, we just reorganize the data format in TSV files to better fit our codebase.

    Due to copyright issue, we could not release the raw videos. We suggest downloading the orignal raw videos from the official dataset websites. Please place the downloaded videos under raw_videos or videos of each dataset folder.

    The datasets directory structure should follow the below hierarchy.

    ${ROOT}  
    |-- datasets  
    |   |-- VATEX  
    |   |   |-- *.yaml 
    |   |   |-- *.tsv  
    |   |   |-- raw_videos  <<< please place the downloaded videos under this folder 
    |   |   |   |-- val_all
    |   |   |   |   |-- *.mp4
    |   |   |   |-- holdout_test
    |   |   |   |   |-- test
    |   |   |   |   |   |-- *.mp4
    |   |-- MSRVTT-v2  
    |   |   |-- *.yaml 
    |   |   |-- *.tsv  
    |   |   |-- videos <<< please place the downloaded videos under this folder 
    |   |   |   |-- *.mp4 
    |   |-- MSVD  
    |   |   |-- *.yaml 
    |   |   |-- *.tsv  
    |   |   |-- videos <<< please place the downloaded videos under this folder 
    |   |   |   |-- *.avi 
    |   |-- TVC  
    |   |   |-- *.yaml 
    |   |   |-- *.tsv  
    |   |   |-- videos <<< please place the downloaded videos under this folder 
    |   |   |   |-- bbt_new
    |   |   |   |-- castle
    |   |   |   |-- friends
    |   |   |   |-- grey
    |   |   |   |-- house
    |   |   |   |-- met 
    |   |-- YouCook2  
    |   |   |-- *.yaml 
    |   |   |-- *.tsv  
    |   |   |-- training <<< please place the downloaded training videos under this folder 
    |   |   |   |-- *.mp4 
    |   |   |-- validation <<< please place the downloaded validation videos under this folder 
    |   |   |   |-- *.mp4 
    |   |   |-- testing <<< please place the downloaded testing videos under this folder 
    |   |   |   |-- *.mp4 
    |-- docs
    |-- src
    |-- scripts
    |-- models 
    |-- README.md 
    |-- ... 
    |-- ... 
    
    

    We also provide example scripts to reproduce our annotation tsv files. You may find the examples below.

    ${ROOT}  
    |-- prepro  
    |   |-- tsv_preproc_vatex.py
    |   |-- tsv_preproc_msrvtt.py
    |   |-- tsv_preproc_msvd.py
    |   |-- tsv_preproc_tvc.py
    |   |-- tsv_preproc_youcook2.py
    |-- docs
    |-- src
    |-- scripts
    |-- README.md 
    |-- ... 
    |-- ... 
    
    

Before Running Code: Launch Docker Container

We provide a Docker image for easier reproduction. Please launch the docker container before running our codes.

export REPO_DIR=$PWD
DATASETS=$REPO_DIR'/datasets/'
MODELS=$REPO_DIR'/models/'
OUTPUT_DI
View on GitHub
GitHub Stars249
CategoryContent
Updated6d ago
Forks35

Languages

Python

Security Score

95/100

Audited on Mar 27, 2026

No findings