Lighthouse
[EMNLP2024 Demo], [ICASSP 2025], [ICASSP 2026] A user-friendly library for reproducible video moment retrieval and highlight detection. It also supports audio moment retrieval.
Install / Use
/learn @line/LighthouseREADME
Lighthouse
Lighthouse is a user-friendly library for reproducible video moment retrieval and highlight detection (MR-HD). It supports seven models, four features (video and audio features), and six datasets for reproducible MR-HD, MR, and HD. In addition, we prepare an inference API and Gradio demo for developers to use state-of-the-art MR-HD approaches easily. Furthermore, Lighthouse supports audio moment retrieval, a task to identify relevant moments from an audio input based on a given text query.
News
- [2026/01/18] Our work "CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries " has been accepted at ICASSP 2026.
- [2025/11/20] Version 1.2 Our work "CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries" has been released. This update adds support for a new AMR dataset called CASTELLA.
- [2025/06/04] Version 1.1 has been released. It includes API changes, AMR gradio demo, and huggingface wrappers for the audio moment retrieval and clotho dataset.
- [2024/12/24] Our work "Language-based audio moment retrieval" has been accepted at ICASSP 2025.
- [2024/10/22] Version 1.0 has been released.
- [2024/10/6] Our paper has been accepted at EMNLP2024, system demonstration track.
- [2024/09/25] Our work "Language-based audio moment retrieval" has been released. Lighthouse supports AMR.
- [2024/08/22] Our demo paper is available on arXiv. Any comments are welcome: Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection.
Installation
Install ffmpeg first. If you are an Ubuntu user, run:
apt install ffmpeg
Then, install pytorch, torchvision, and torchaudio based on your GPU environments. Note that the inference API is available for CPU environments. We tested the codes on Python 3.9 and CUDA 11.8:
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118
Finally, run to install dependency libraries:
pip install 'git+https://github.com/line/lighthouse.git'
Inference API (Available for both CPU/GPU mode)
Lighthouse supports the following inference API:
import torch
from lighthouse.models import CGDETRPredictor
# use GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
# slowfast_path is necesary if you use clip_slowfast features
query = 'A man is speaking in front of the camera'
model = CGDETRPredictor('/path/to/weight.ckpt', device=device,
feature_name='clip_slowfast', slowfast_path='SLOWFAST_8x8_R50.pkl')
# encode video features
video = model.encode_video('api_example/RoripwjYFp8_60.0_210.0.mp4')
# moment retrieval & highlight detection
prediction = model.predict(query, video)
print(prediction)
"""
pred_relevant_windows: [[start, end, score], ...,]
pred_saliency_scores: [score, ...]
{'query': 'A man is speaking in front of the camera',
'pred_relevant_windows': [[117.1296, 149.4698, 0.9993],
[-0.1683, 5.4323, 0.9631],
[13.3151, 23.42, 0.8129],
...],
'pred_saliency_scores': [-10.868017196655273,
-12.097496032714844,
-12.483806610107422,
...]}
"""
Lighthouse also supports the AMR inference API:
import torch
from lighthouse.models import QDDETRPredictor
device = "cuda" if torch.cuda.is_available() else "cpu"
model = QDDETRPredictor('/path/to/weight.ckpt', device=device, feature_name='clap')
audio = model.encode_audio('api_example/1a-ODBWMUAE.wav')
query = 'Water cascades down from a waterfall.'
prediction = model.predict(query, audio)
print(prediction)
Run python api_example/demo.py (MR-HD) or python api_example/amr_demo.py (AMR) to reproduce the results. It automatically downloads pre-trained weights.
If you want to use other models, download pre-trained weights.
When using clip_slowfast features, it is necessary to download slowfast pre-trained weights.
When using clip_slowfast_pann features, in addition to the slowfast weight, download panns weights.
Run python api_example/amr_demo.py to reproduce the AMR results.
Limitation: The maximum video duration is 150s due to the current benchmark datasets.
For CPU users, set feature_name='clip' because CLIP+Slowfast or CLIP+Slowfast+PANNs features are very slow without GPUs.
Gradio demo
Run python gradio_demo/demo.py. Upload the video and input text query, and click the blue button. For AMR demo, run python gradio_demo/amr_demo.py.
MR-HD demo

AMR demo

Supported models, datasets, and features
Models
Moment retrieval & highlight detection
- [x] : Moment-DETR (Lei et al. NeurIPS21)
- [x] : QD-DETR (Moon et al. CVPR23)
- [x] : EaTR (Jang et al. ICCV23)
- [x] : CG-DETR (Moon et al. arXiv24)
- [x] : UVCOM (Xiao et al. CVPR24)
- [x] : TR-DETR (Sun et al. AAAI24)
- [x] : TaskWeave (Jin et al. CVPR24)
- [ ] : R2-Tuning (Liu et al. ECCV24)
Datasets
Moment retrieval & highlight detection
- [x] : QVHighlights (Lei et al. NeurIPS21)
- [x] : QVHighlights w/ Audio Features (Lei et al. NeurIPS21)
- [x] : QVHighlights ASR Pretraining (Lei et al. NeurIPS21)
Moment retrieval
- [x] : ActivityNet Captions (Krishna et al. ICCV17)
- [x] : Charades-STA (Gao et al. ICCV17)
- [x] : TaCoS (Regneri et al. TACL13)
Highlight detection
Audio moment retrieval
Features
- [x] : ResNet+GloVe
- [x] : CLIP
- [x] : CLIP+Slowfast
- [x] : CLIP+Slowfast+PANNs (Audio) for QVHighlights
- [x] : I3D+CLIP (Text) for TVSum
Reproduce the experiments
Pre-trained weights
Pre-trained weights can be downloaded from here. Download and unzip on the home directory. AMR models trained on CASTELLA and Clotho-Moment is available in here
Datasets
Due to the copyright issue, we here distribute only feature files.
Download and place them under ./features directory.
To extract features from videos, we use HERO_Video_Feature_Extractor.
For AMR, download features from here.
The whole directory should be look like this:
lighthouse/
├── api_example
├── configs
├── data
├── features # Download the features and place them here
│ ├── ActivityNet
│ │ ├── clip
│ │ ├── clip_text
│ │ ├── resnet
│ │ └── slowfast
│ ├── Charades
│ │ ├── clip
│ │ ├── clip_text
│ │ ├── resnet
│ │ ├─
