CGDETR
Official pytorch repository for CG-DETR "Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding"
Install / Use
/learn @wjun0830/CGDETRREADME
CG-DETR : Calibrating the Query-Dependency of Video Representation via Correlation-guided Attention for Video Temporal Grounding
Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding
WonJun Moon, Sangeek Hyun, SuBeen Lee, Jae-Pil Heo <br> Sungkyunkwan University
Arxiv
🥇<br>
🥇
<br>
🥇
<br>
🥇
<br>
🥇
<br>
🥇
🔖 Abstract
Recent endeavors in video temporal grounding enforce strong cross-modal interactions through attention mechanisms to overcome the modality gap between video and text query. However, previous works treat all video clips equally regardless of their semantic relevance with the text query in attention modules. In this paper, our goal is to provide clues for query-associated video clips within the crossmodal encoding process. With our Correlation-Guided Detection Transformer~(CG-DETR), we explore the appropriate clip-wise degree of cross-modal interactions and how to exploit such degrees for prediction. First, we design an adaptive cross-attention layer with dummy tokens. Dummy tokens conditioned by text query take a portion of the attention weights, preventing irrelevant video clips from being represented by the text query. Yet, not all word tokens equally inherit the text query's correlation to video clips. Thus, we further guide the cross-attention map by inferring the fine-grained correlation between video clips and words. We enable this by learning a joint embedding space for high-level concepts, \textit{i.e}., moment and sentence level, and inferring the clip-word correlation. Lastly, we use a moment-adaptive saliency detector to exploit each video clip's degrees of text engagement. We validate the superiority of CG-DETR with the state-of-the-art results on various benchmarks for both moment retrieval and highlight detection.
📢 To be updated
Todo
- [x] : Upload instruction for dataset download
- [x] : Update model zoo
- [x] : Upload implementation
📑 Datasets
<b>QVHighlights</b> : Download official feature files for QVHighlights dataset from moment_detr_features.tar.gz (8GB).
tar -xf path/to/moment_detr_features.tar.gz
If inaccessible, then download from
<b> QVHighlight</b> 9.34GB. <br>
For other datasets, we provide extracted features:
<b> Charades-STA</b> 33.18GB. (Including SF+C and VGG features) <br> <b> TACoS </b> 290.7MB. <br> <b> TVSum </b> 69.1MB. <br> <b> Youtube </b> 191.7MB. <br>
After downloading, either prepare the data directory as below or change 'feat_root' in TVSum shell files under 'cg_detr/scripts/*/'.
.
├── CGDETR
│ ├── cg_detr
│ └── data
│ └── results
│ └── run_on_video
│ └── standalone_eval
│ └── utils
├── features
└── qvhighlight
└── charades
└── tacos
└── tvsum
└── youtube_uni
🛠️ Installation
Python version 3.7 is required.
- Clone this repository.
git clone https://github.com/wjun0830/CGDETR.git
- Download the packages we used for training.
pip install -r requirements.txt
🚀 Training
We provide training scripts for all datasets in cg_detr/scripts/ directory.
QVHighlights Training
Training can be executed by running the shell below:
bash cg_detr/scripts/train.sh
Best validation accuracy is yielded at the last epoch.
Charades-STA
For training, run the shell below:
bash cg_detr/scripts/charades_sta/train.sh
bash cg_detr/scripts/charades_sta/train_vgg.sh
TACoS
For training, run the shell below:
bash cg_detr/scripts/tacos/train.sh
TVSum
For training, run the shell below:
bash cg_detr/scripts/tvsum/train_tvsum.sh
Best results are stored in 'results_[domain_name]/best_metric.jsonl'.
Youtube-hl
For training, run the shell below:
bash cg_detr/scripts/youtube_uni/train.sh
Best results are stored in 'results_[domain_name]/best_metric.jsonl'.
QVHighlights w/ Pretraining Training
Training can be executed by running the shell below:
bash cg_detr/scripts/train.sh --num_dummies 45 --num_prompts 1 --total_prompts 10 --max_q_l 75 --resume pt_checkpoints/model_e0009.ckpt --seed 2018
Checkpoints for pretrained checkpoint 'model_e0009.ckpt' is available here.
👀 QVHighlights Evaluation and Codalab Submission
Once the model is trained, hl_val_submission.jsonl and hl_test_submission.jsonl can be yielded by running inference.sh.
Compress them into a single .zip file and submit the results.
bash cg_detr/scripts/inference.sh results/{direc}/model_best.ckpt 'val'
bash cg_detr/scripts/inference.sh results/{direc}/model_best.ckpt 'test'
where direc is the path to the saved checkpoint.
For more details, check standalone_eval/README.md.
📹 Others (Custom video inference / training)
- Running predictions on customized datasets is also available.
Note that only the CLIP-only trained model is available for custom video inference. <br>
You can either <br>
1)
Preparing your custom video and text query under 'run_on_video/example',<br> 2)Modify the youtube video url and custom text query in 'run_on_video/run.py'<br> (youtube_url : video link url, [vid_st_sec, vid_ec_sec] : start and end time of the video (specify less than 150 frames), desired_query : text query) <br> Then, run the following commands:`
pip install ffmpeg-python ftfy regex
PYTHONPATH=$PYTHONPATH:. python run_on_video/run.py
- For instructions for training on custom datasets, check here.
📦 Model Zoo
Dataset | Model file -- | -- QVHighlights | checkpoints Charades (Slowfast + CLIP) | checkpoints Charades (VGG) | checkpoints TACoS | checkpoints TVSum | checkpoints Youtube-HL | checkpoints QVHighlights w/ PT (47.97 mAP) | checkpoints QVHighlights only CLIP | checkpoints
📖 BibTeX
If you find the repository or the paper useful, please use the following entry for citation.
@article{moon2023correlation,
title={Correlation-guided Query-Dependency Calibration in Video Representation Learning for Temporal Grounding},
author={Moon, WonJun and Hyun, Sangeek and Lee, SuBeen and Heo, Jae-Pil},
journal={arXiv preprint arXiv:2311.08835},
year={2023}
}
☎️ Contributors and Contact
If there are any questions, feel free to contact the authors: WonJun Moon (wjun0830@gmail.com), Sangeek Hyun (hse1032@gmail.com), and SuBeen Lee (leesb7426@gmail.com)
☑️ LICENSE
The annotation files and many parts of the implementations are borrowed from Moment-DETR and QD-DETR. Our codes are under MIT license.
Related Skills
qqbot-channel
353.1kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.7k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
353.1kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
arscontexta
3.1kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
