MRNet
No description available
Install / Use
/learn @xian-sh/MRNetREADME
Maskable Retentive Network for Video Moment Retrieval
Source code for our ACM MM 2024 paper
Task Example: The goal of both MR tasks NLMR (natural language moment retrieval) and SLMR (spoken language moment retrieval) is to predict the temporal boundaries $(\tau_{start}, \tau_{end})$ of target moment described by a given query $q$ (text or audio modality).
<p align="center"> <img src="./assets/task_new1.png" width="70%"> </p> Two important characteristics:
1) Temporal association between video clips: The temporal correlation between two video clips that are farther apart is weaker;
2) Redundant background interference: The background contains a lot of redundant information that can interfere with the recognition of the current event, and this redundancy is even worse in long videos.
Approach
The architecture of the Maskable Retentive Network (MRNet). We conduct modality-specific attention modes, that is, we set Unlimited Attention for language-related attention regions to maximize cross-modal mutual guidance, and perform a new Maskable Retention for video branch $\mathcal{A}(v\to v)$ for enhanced video sequence modeling.
<div align="center"> <img src="./assets/main_model.png" alt="Approach" width="800" height="210"> </div>Download and prepare the datasets
- Download the datasets (Optional).
-
The video feature provided by 2D-TAN
ActivityNet Captions C3D feature TACoS C3D feature -
The video I3D feature of Charades-STA dataset from LGI
wget http://cvlab.postech.ac.kr/research/LGI/charades_data.tar.gz tar zxvf charades_data.tar.gz mv charades data rm charades_data.tar.gz -
The Audio Captions: ActivityNet Speech Dataset: download the original audio proposed by VGCL
2. For convenience, the extracted input data features can be downloaded directly from baiduyun, passcode:5bwp
- Text and audio feature extraction (Optional).
cd preprocess
python text_encode.py
python audio_encode.py
4. Set your own dataset path in the following .py file.
ret/config/paths_catalog.py
5. Or prepare the files in the following structure (Optional).
MRNet
├── configs
├── dataset
├── ret
├── data
│ ├── activitynet
│ │ ├── *text features
│ │ ├── *audio features
│ │ └── *video c3d features
│ ├── charades
│ │ ├── *text features
│ │ └── *video i3d features
│ └── tacos
│ ├── *text features
│ └── *video c3d features
├── train_net.py
├── test_net.py
└── ···
Dependencies
pip install yacs h5py terminaltables tqdm librosa transformers
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
Training
ActivityNet
python train_net.py --config-file --config-file checkpoints/best/activity/config.yml
TACoS
cd ret/modeling/ret_model, then copy the code in fileret_model_tacos.pyto fileret_model.py.python train_net.py --config-file checkpoints/best/tacos/config.yml
Charades
python train_net.py --config-file --config-file checkpoints/best/charades/config.yml
Testing
ActivityNet
- download the model weight file from Google Drive to the
checkpoints/best/activityfolder python test_net.py --config-file checkpoints/best/activity/config.yml --ckpt checkpoints/best/activity/pool_model_14.pth
TACoS
- download the model weight file from Google Drive to the
checkpoints/best/tacosfolder cd ret/modeling/ret_model, then copy the code in fileret_model_tacos.pyto fileret_model.py.python test_net.py --config-file checkpoints/best/tacos/config.yml --ckpt checkpoints/best/tacos/pool_model_110e.pth
Charades
The weights have been lost🤦♂️. For reproduction or evaluation, please train yourself
LICENSE
The annotation files and many parts of the implementations are borrowed from MMN. Our codes are under MIT license.
