SkillAgentSearch skills...

I2Transformer

[IEEE TIP 2022] This is Pytorch code for our paper "I2Transformer: Intra- and Inter-relation Embedding Transformer for TV Show Captioning".

Install / Use

/learn @tuyunbin/I2Transformer
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

I<sup>2</sup>Transformer: Intra- and Inter-relation Embedding Transformer for TV Show Captioning

This package contains the accompanying code for the following paper:

Tu, Yunbin, et al. "I<sup>2</sup>Transformer: Intra- and Inter-relation Embedding Transformer for TV Show Captioning.", which has appeared as regular paper in IEEE TIP。

We illustrate the training details as follows:

1. Prepare feature files

Download tvc_feature_release.tar.gz (23GB), and extraction code is tvcf. After downloading the file, extract it to the data directory.

tar -xf path/to/tvc_feature_release.tar.gz -C data

You should be able to see video_feature under data/tvc_feature_release directory. It contains video features (ResNet, I3D, ResNet+I3D). Plase note that this code only used the features of ResNet+I3D.

2. Install dependencies:

  • Ubuntu 16.04
  • Python 2.7
  • PyTorch 1.1.0
  • nltk
  • easydict
  • tqdm
  • h5py
  • tensorboardX
  • An RTX 2080Ti

3. Add project root to PYTHONPATH

source setup.sh

Note that you need to do this each time you start a new session.

4. Git clone the Microsoft COCO evaluation server to evaluate captions and place it in the dir: 'I2Transformer/standalone_eval/'

git clone https://github.com/tylin/coco-caption.git

5. Build Vocabulary

You could skip this step because this file has been provided.

bash baselines/multimodal_transformer/scripts/build_vocab.sh

Running this command will build vocabulary cache/tvc_word2idx.json from TVC train set.

6. I<sup>2</sup>Transformer training

bash baselines/multimodal_transformer/scripts/train.sh video_sub resnet_i3d

This code will load all the data (~30GB) into RAM to speed up training, use --no_core_driver to disable this behavior.

Training using the above config will stop at around epoch 22, around 7 hours with a single 2080Ti GPU. You should get ~47.2 CIDEr and ~11.4 BLEU@4 scores on val set. The resulting model and config will be saved at a dir: baselines/multimodal_transformer/results/video_sub-res-*

7. I<sup>2</sup>Transformer inference

After training, you can inference using the saved model on val or test_public set:

bash baselines/multimodal_transformer/scripts/translate.sh MODEL_DIR_NAME SPLIT_NAME

MODEL_DIR_NAME is the name of the dir containing the saved model, e.g., video_sub-res-*. SPLIT_NAME could be val or test_public.

8. Our results

The generated captions and evaluation scores on the val and test_public set are in the dir: 'our_results'

Citing

If you find this helps your research, please consider citing:

@article{tu2022i2transformer,
  title={I2Transformer: Intra-and Inter-relation Embedding Transformer for TV Show Captioning},
  author={Tu, Yunbin and Li, Liang and Su, Li and Gao, Shengxiang and Yan, Chenggang and Zha, Zheng-Jun and Yu, Zhengtao and Huang, Qingming},
  journal={IEEE Transactions on Image Processing},
  year={2022},
  publisher={IEEE}
}

Contact

My email is tuyunbin1995@foxmail.com

Any discussions and suggestions are welcome!

Acknowledgement

This work and code are inspired by TVCaption. Thanks for their solid work!

View on GitHub
GitHub Stars4
CategoryDevelopment
Updated8mo ago
Forks1

Languages

Python

Security Score

62/100

Audited on Jul 7, 2025

No findings