MGSV
[ICCV 2025] This repo is the official implementation of "Music Grounding by Short Video"
Install / Use
/learn @xxayt/MGSVREADME
📄 Abstract
Adding proper background music helps complete a short video to be shared. Previous work tackles the task by video-to-music retrieval (V2MR), aiming to find the most suitable music track from a collection to match the content of a given query video. In practice, however, music tracks are typically much longer than the query video, necessitating (manual) trimming of the retrieved music to a shorter segment that matches the video duration. In order to bridge the gap between the practical need for music moment localization and V2MR, we propose a new task termed <u>M</u>usic <u>G</u>rounding by <u>S</u>hort <u>V</u>ideo (MGSV). To tackle the new task, we introduce a new benchmark, MGSV-EC, which comprises a diverse set of 53k short videos associated with 35k different music moments from 4k unique music tracks. Furthermore, we develop a new baseline method, MaDe, which performs both video-to-music matching and music moment detection within a unified end-to-end deep network. Extensive experiments on MGSV-EC not only highlight the challenging nature of MGSV but also set MaDe as a strong baseline.
👀 Introduction
This repository contains the official implementation of our paper, including training and evaluation scripts for the MGSV task.
🔧 Dependencies and Installation
We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install all the required packages.
# git clone this repository
git clone https://github.com/xxayt/MGSV.git
cd MGSV
# create a new anaconda env
conda create -n MGSV_env python=3.8
conda activate MGSV_env
# install torch and dependencies
pip install -r requirements.txt
📦 Data
📥 Data download
Please refer to the guides from huggingface for downloading the dataset MGSV-EC.
📥 Encoder Download (Optional)
-
AST Encoder: Download the AST model
audioset_0.4593.pthfrom Dropbox. This model follows the implementation in the AST repo and can be used for audio feature extraction. You can also explore thefor hands-on usage.
-
CLIP Encoder: Download the CLIP model
ViT-B-32.ptfrom this link. This model follows the implementation in the CLIP repo, specifically clip.py, for visual feature extraction.
🗂️ Files organization
After downloading the dataset and encoder model, organize the files as follows:
.
├── dataset
│ └── MGSV-EC
│ ├── train_data.csv
│ ├── val_data.csv
│ └── test_data.csv
├── features
│ └── Kuai_feature
│ ├── ast_feature2p5/
│ └── vit_feature1/
├── model
│ ├── ...
│ └── pretrained_models
│ ├── audioset_0.4593.pth
│ └── ViT-B-32.pt
└── README.md
🚀 How to Run
Training
We provide a demo training script. To train MaDe on a specified GPU, use the following command:
bash scripts/train_kuai_all_feature.sh
Make sure to modify the data path, save path, and set the GPU for training. This process can be done using a single GPU.
Evaluation
To evaluate the model on the test set, use the following command:
bash scripts/test_kuai_all_feature.sh
Ensure that you update the script with the weight path --load_uni_model_path obtained from the training phase.
🤝 Acknowledgement
This implementation relies on resources from AST, DETR, Moment-DETR, CLIP4Clip, X-Pool and UT-CMVMR. We thank the original authors for their excellent contributions and for making their work publicly available.
✏️ Citation
If you find this work useful, please consider cite this paper:
@inproceedings{xin2025mgsv,
title={Music Grounding by Short Video},
author={Xin, Zijie and Wang, Minquan and Liu, Jingyu and Chen, Quan and Ma, Ye and Jiang, Peng and Li, Xirong},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2025}
}
📜 License
The MGSV-EC dataset is under CC BY-NC-ND 4.0 license, see DATA-LICENSE. All the codes are under MIT license, see LICENSE. For commercial licensing or any use beyond research, please contact the authors.
📥 Raw Vidoes/Music-tracks Access
The raw video and music files are not publicly available due to copyright and privacy constraints.
Researchers interested in obtaining the full media content can contact Kuaishou Technology at: wangminquan@kuaishou.com.
📬 Contact for Issues
For any questions about this project (e.g., corrupted files or loading errors), please reach out at: xinzijie@ruc.edu.cn
