Multimodal Intent Discovery from Livestream Videos

PyTorch code for the Findings of NAACL 2022 paper "Multimodal Intent Discovery from Livestream Videos"

This code has been tested on torch==1.9.0 and transformers==4.3.2. Other requirements are moviepy for splicing videos.

We are releasing two datasets in this paper:

Behance Intent Discovery Dataset This is a dataset containing ~20K sentences with manual annotations for tool and creative intents (see paper) and accompanied by timestamps for the livestream video they have been taken from. The files are available in the ./data/bid/ folder. Use ./scripts/download_videos.py to download and splice the videos for the timestamps present in the dataset. We follow the HERO paper for extracting video representations; see this repository for extraction code.
Behance Livestreams Corpus: This is the larger unlabelled corpus containing nearly 8K full-length videos and their respective transcripts (download scripts coming soon).

The scripts for training the models presented in the paper are available under ./model/.

To train the unimodal RoBERTa model on the Behance Intent Discovery dataset, run

bash behance_unimodal.sh <GPU_ID>

To train the multimodal late fusion RoBERTa model on the Behance Intent Discovery dataset, run:

bash behance_late_fusion.sh <feature_type> <path_to_feature_directory> <GPU_ID>

To train the multimodal late fusion RoBERTa model on the Behance Intent Discovery dataset, run:

bash behance_late_fusion.sh <feature_type> <path_to_feature_directory> <GPU_ID>

Dockerized containers for training HERO + Late Fusion and ClipBERT + Late Fusion models are coming soon.

Acknowledgement:

The code in this repository has been adapted from BOND and HERO codebases.