SparseSync
Source code for "Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors." (Spotlight at the BMVC 2022)
Install / Use
/learn @v-iashin/SparseSyncREADME
Audio-visual Synchronisation with Trainable Selectors
Our paper is accepted for a spotlight presentation at the BMVC 2022. Please, use this BibTeX if you would like to cite our work:
@InProceedings{sparse2022iashin,
title={Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors},
author={Iashin, V., Xie, W., Rahtu, E. and Zisserman, A.},
booktitle={British Machine Vision Conference (BMVC)},
year={2022}
}
• [Project Page] • [ArXiv] • [BMVC Proceedings] • [Presentation (full)] • [Presentation (spotlight)] •
<img src="https://v-iashin.github.io/images/sparsesync/sparse_selector_teaser.png" alt="SparseSync Teaser (comparing viddeos with dense and sparse signals)" width="900">Audio-visual synchronisation is the task of determining the temporal offset between the audio and visual streams in a video. The synchronisation of 'in the wild' video clips might be challenged as the synchronisation cues might be spatially small and occur sparsely in time. However, recent literature was mostly dedicated to exploring videos of talking heads or playing instruments. Such videos have a dense synchronisation signal due to the strong correlation between audio and visual streams.
<img src="https://v-iashin.github.io/images/sparsesync/sparse_selector_arch.png" alt="SparseSync Architecture" width="900">To handle the synchronisation of sparse signals in time a model should be able to process longer video clips and have enough capacity to handle the diversity of scenes. To this end, we propose SparseSelector, a transformer-based architecture that enables the processing of long videos in linear complexity with respect to the number of input tokens which grows rapidly with sampling rate, resolution, and video duration.
- Audio-visual Synchronisation with Trainable Selectors
Updates
- See our newest synchronisation model called Synchformer which significantly outperforms SparseSync.
- Added a model trained on AudioSet (see pre-trained checkpoints)
Environment Preparation
During experimentation, we used Linux machines with conda virtual environments, PyTorch 1.11 and CUDA 11.3.
Start by cloning this repo
git clone https://github.com/v-iashin/SparseSync.git
Conda
Next, install the environment.
For your convenience, we provide a conda environment:
conda env create -f conda_env.yml
Test your environment
conda activate sparse_sync
python -c "import torch; print(torch.cuda.is_available())"
# True
Docker
Download the image from Docker Hub and test if CUDA is available:
docker run \
--mount type=bind,source=/absolute/path/to/SparseSync/,destination=/home/ubuntu/SparseSync/ \
--mount type=bind,source=/absolute/path/to/logs/,destination=/home/ubuntu/SparseSync/logs/ \
--shm-size 8G \
-it --gpus '"device=0"' \
iashin/sparse_sync:latest \
python
>>> import torch; print(torch.cuda.is_available())
# True
or build it yourself
docker build - < Dockerfile --tag sparse_sync
Try one of the examples:
docker run \
--mount type=bind,source=/absolute/path/to/SparseSync/,destination=/home/ubuntu/SparseSync/ \
--mount type=bind,source=/absolute/path/to/logs/,destination=/home/ubuntu/SparseSync/logs/ \
--shm-size 8G \
-it --gpus '"device=0"' \
iashin/sparse_sync:latest \
bash
ubuntu@cfc79e3be757:~$
cd SparseSync/
ubuntu@cfc79e3be757:~/SparseSync$
python ./scripts/example.py \
--exp_name "22-09-21T21-00-52" \
--vid_path "./data/vggsound/h264_video_25fps_256side_16000hz_aac/3qesirWAGt4_20000_30000.mp4" \
--offset_sec 1.6
# Prediction Results:
# p=0.8652 (8.4451), "1.60" (18)
Prepare Data
In this project, we used the LRS3 dataset and introduced a novel VGGSound-Sparse dataset. We provide the pre-processing scripts and assume that the original videos have been downloaded from YouTube.
LRS3-H.264 and LRS3-H.264 ('No Face Crop')
Difference between LRS3 and LRS3-H.264 and LRS3-H.264 ('No Face Crop')
For the setting 'dense in time and space', we rely on the LRS3 dataset.
One may access the original LRS3 dataset by following the instructions on the
project page.
However, this dataset is encoded with MPEG-4 Part 2 codec.
As per our discussion in the paper (sec. 4.), we would like to avoid this encoding.
For this reason, we obtain the original videos from YouTube using the provided
links (mind that _ and - in video ids were replaced by S).
By doing so, we could use videos with H.264 encoding which has another benefit.
In particular, the videos are of slightly better quality.
Another difference to the original LRS3 is in the way a face is cropped.
We could not replicate the same cropping algorithm but the authors
provided
bounding coordinates for a 'tight' crop.
We simply expanded the rectangular region to have square proportions and ensured
the bounding box within the video frame to avoid padding.
This has two benefits compared to the original LRS3: padding is not visible
and the visual track is not smooth which gives some sort of a natural
augmentation during training.
The trimming and cropping scripts are provided in ./scripts/make_lrs3_again.py.
The LRS3-H.264 ('No Face Crop') variant of the dataset does not have a face crop.
Therefore, the pre-processing pipeline is as follows.
First, obtain the original videos from YouTube
(ids are provided on LRS3 project page).
Second, slice each video into clips, do face crop, and resizing according to the LRS3 meta (see the link)
data with the ./scripts/make_lrs3_again.py script.
For LRS3-H.264 ('No Face Crop') use:
python ./scripts/make_lrs3_again.py \
--lrs_meta_root "$LRS3_ROOT/orig_full/lrs3_v0.4/" \
--full_vids_root "$LRS3_ROOT/orig_full/data/lrs_ref/video/" \
--save_root "./data/lrs3/h264_uncropped_25fps_256side_16000hz_aac/" \
--rescale_to_px 256
where full_vids_root has full-length .mp4 videos downloaded from YouTube.
For LRS3-H.264 use:
python ./scripts/make_lrs3_again.py \
--lrs_meta_root "$LRS3_ROOT/orig_full/lrs3_v0.4/" \
--full_vids_root "$LRS3_ROOT/orig_full/data/lrs_ref/video/" \
--save_root "./data/lrs3/h264_orig_strict_crop_25fps_224side_16000hz_aac/" \
--do_face_crop \
--rescale_to_px 224
You can spawn as many processes as your machine permits to speed it up
(e.g. by running the same command in separate terminals).
The script (./scripts/make_lrs3_again.py) will randomize the order of videos to avoid processing collision.
SLURM might help here if you have a cluster at your disposal.
You may create an array of jobs running ./scripts/make_lrs3_again.py.
See ./data/lrs3/ (LRS3_ROOT) for the expected folder structure and a few examples.
VGGSound-Sparse
VGGSound-Sparse is based on the VGGSound dataset and you will need to obtain the original YouTube videos first. The annotations are freely downloadable.
There is no specific pre-processing is required on VGGSound videos, except for reencoding of streams.
This can be achieved with the script that is available in ./scripts/reencode_videos.py.
First, open the file and change the ORIG_PATH variable to a folder with a structure as in
./data/vggsound/video/:
python ./scripts/reencode_videos.py
It is also safe to parallelize for multiple threads and, perhaps, a cluster.
Pre-trained Model Checkpoints
When you run an example,
the checkpoints and configs for SparseSync will be downloaded automatically.
Alternatively, you can download the pre-trained weights manually:
LRS3 ('No Face Crop') Models
| Pre-trained on | Fine-tuned on | Classes | Accuracy | config | ckpt | | --------------------- | ------------- | -------
