Audio-visual Synchronisation with Trainable Selectors

Iashin, V., Xie, W., Rahtu, E. and Zisserman, A. "Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors"

Our paper is accepted for a spotlight presentation at the BMVC 2022. Please, use this BibTeX if you would like to cite our work:

@InProceedings{sparse2022iashin,
  title={Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors},
  author={Iashin, V., Xie, W., Rahtu, E. and Zisserman, A.},
  booktitle={British Machine Vision Conference (BMVC)},
  year={2022}
}

• [Project Page] • [ArXiv] • [BMVC Proceedings] • [Presentation (full)] • [Presentation (spotlight)] •

Audio-visual synchronisation is the task of determining the temporal offset between the audio and visual streams in a video. The synchronisation of 'in the wild' video clips might be challenged as the synchronisation cues might be spatially small and occur sparsely in time. However, recent literature was mostly dedicated to exploring videos of talking heads or playing instruments. Such videos have a dense synchronisation signal due to the strong correlation between audio and visual streams.

To handle the synchronisation of sparse signals in time a model should be able to process longer video clips and have enough capacity to handle the diversity of scenes. To this end, we propose SparseSelector, a transformer-based architecture that enables the processing of long videos in linear complexity with respect to the number of input tokens which grows rapidly with sampling rate, resolution, and video duration.

Audio-visual Synchronisation with Trainable Selectors

Updates

See our newest synchronisation model called Synchformer which significantly outperforms SparseSync.
Added a model trained on AudioSet (see pre-trained checkpoints)

Environment Preparation

During experimentation, we used Linux machines with conda virtual environments, PyTorch 1.11 and CUDA 11.3.

Start by cloning this repo

git clone https://github.com/v-iashin/SparseSync.git

Conda

Next, install the environment. For your convenience, we provide a conda environment:

conda env create -f conda_env.yml

Test your environment

conda activate sparse_sync
python -c "import torch; print(torch.cuda.is_available())"
# True

Docker

Download the image from Docker Hub and test if CUDA is available:

docker run \
    --mount type=bind,source=/absolute/path/to/SparseSync/,destination=/home/ubuntu/SparseSync/ \
    --mount type=bind,source=/absolute/path/to/logs/,destination=/home/ubuntu/SparseSync/logs/ \
    --shm-size 8G \
    -it --gpus '"device=0"' \
    iashin/sparse_sync:latest \
    python
>>> import torch; print(torch.cuda.is_available())
# True

or build it yourself

docker build - < Dockerfile --tag sparse_sync

Try one of the examples:

docker run \
    --mount type=bind,source=/absolute/path/to/SparseSync/,destination=/home/ubuntu/SparseSync/ \
    --mount type=bind,source=/absolute/path/to/logs/,destination=/home/ubuntu/SparseSync/logs/ \
    --shm-size 8G \
    -it --gpus '"device=0"' \
    iashin/sparse_sync:latest \
    bash

ubuntu@cfc79e3be757:~$
cd SparseSync/

ubuntu@cfc79e3be757:~/SparseSync$
python ./scripts/example.py \
  --exp_name "22-09-21T21-00-52" \
  --vid_path "./data/vggsound/h264_video_25fps_256side_16000hz_aac/3qesirWAGt4_20000_30000.mp4" \
  --offset_sec 1.6
# Prediction Results:
# p=0.8652 (8.4451), "1.60" (18)

Prepare Data

In this project, we used the LRS3 dataset and introduced a novel VGGSound-Sparse dataset. We provide the pre-processing scripts and assume that the original videos have been downloaded from YouTube.

LRS3-H.264 and LRS3-H.264 ('No Face Crop')

Difference between LRS3 and LRS3-H.264 and LRS3-H.264 ('No Face Crop') For the setting 'dense in time and space', we rely on the LRS3 dataset. One may access the original LRS3 dataset by following the instructions on the project page. However, this dataset is encoded with MPEG-4 Part 2 codec. As per our discussion in the paper (sec. 4.), we would like to avoid this encoding. For this reason, we obtain the original videos from YouTube using the provided links (mind that _ and - in video ids were replaced by S). By doing so, we could use videos with H.264 encoding which has another benefit. In particular, the videos are of slightly better quality.

Another difference to the original LRS3 is in the way a face is cropped. We could not replicate the same cropping algorithm but the authors provided bounding coordinates for a 'tight' crop. We simply expanded the rectangular region to have square proportions and ensured the bounding box within the video frame to avoid padding. This has two benefits compared to the original LRS3: padding is not visible and the visual track is not smooth which gives some sort of a natural augmentation during training. The trimming and cropping scripts are provided in ./scripts/make_lrs3_again.py.

The LRS3-H.264 ('No Face Crop') variant of the dataset does not have a face crop.

Therefore, the pre-processing pipeline is as follows. First, obtain the original videos from YouTube (ids are provided on LRS3 project page). Second, slice each video into clips, do face crop, and resizing according to the LRS3 meta (see the link) data with the ./scripts/make_lrs3_again.py script.

For LRS3-H.264 ('No Face Crop') use:

python ./scripts/make_lrs3_again.py \
    --lrs_meta_root "$LRS3_ROOT/orig_full/lrs3_v0.4/" \
    --full_vids_root "$LRS3_ROOT/orig_full/data/lrs_ref/video/" \
    --save_root "./data/lrs3/h264_uncropped_25fps_256side_16000hz_aac/" \
    --rescale_to_px 256

where full_vids_root has full-length .mp4 videos downloaded from YouTube.

For LRS3-H.264 use:

python ./scripts/make_lrs3_again.py \
    --lrs_meta_root "$LRS3_ROOT/orig_full/lrs3_v0.4/" \
    --full_vids_root "$LRS3_ROOT/orig_full/data/lrs_ref/video/" \
    --save_root "./data/lrs3/h264_orig_strict_crop_25fps_224side_16000hz_aac/" \
    --do_face_crop \
    --rescale_to_px 224

You can spawn as many processes as your machine permits to speed it up (e.g. by running the same command in separate terminals). The script (./scripts/make_lrs3_again.py) will randomize the order of videos to avoid processing collision. SLURM might help here if you have a cluster at your disposal. You may create an array of jobs running ./scripts/make_lrs3_again.py.

See ./data/lrs3/ (LRS3_ROOT) for the expected folder structure and a few examples.

VGGSound-Sparse

VGGSound-Sparse is based on the VGGSound dataset and you will need to obtain the original YouTube videos first. The annotations are freely downloadable.

There is no specific pre-processing is required on VGGSound videos, except for reencoding of streams. This can be achieved with the script that is available in ./scripts/reencode_videos.py. First, open the file and change the ORIG_PATH variable to a folder with a structure as in ./data/vggsound/video/:

python ./scripts/reencode_videos.py

It is also safe to parallelize for multiple threads and, perhaps, a cluster.

Pre-trained Model Checkpoints

When you run an example, the checkpoints and configs for SparseSync will be downloaded automatically. Alternatively, you can download the pre-trained weights manually:

LRS3 ('No Face Crop') Models

| Pre-trained on | Fine-tuned on | Classes | Accuracy | config | ckpt | | --------------------- | ------------- | -------

SparseSync

Install / Use

README