Taming Visually Guided Sound Generation

BMVC 2021 – Oral Presentation

• [Project Page] • [ArXiv] • [BMVC Proceedings] • [Poster (for PAISS)] • [Presentation on YouTube] (Can't watch YouTube?) •

Listen for the samples on our project page.

Overview

We propose to tame the visually guided sound generation by shrinking a training dataset to a set of representative vectors aka. a codebook. These codebook vectors can, then, be controllably sampled to form a novel sound given a set of visual cues as a prime.

The codebook is trained on spectrograms similarly to VQGAN (an upgraded VQVAE). We refer to it as Spectrogram VQGAN

Once the spectrogram codebook is trained, we can train a transformer (a variant of GPT-2) to autoregressively sample the codebook entries as tokens conditioned on a set of visual features

This approach allows training a spectrogram generation model which produces long, relevant, and high-fidelity sounds while supporting tens of data classes.

Taming Visually Guided Sound Generation
Overview
Environment Preparation
- Conda
- Docker
Data
- Download
- Extract Features Manually
Pretrained Models
Training
Evaluation
Sampling Tool
The Neural Audio Codec Demo
Citation
Acknowledgments

Environment Preparation

During experimentation, we used Linux machines with conda virtual environments, PyTorch 1.8 and CUDA 11.

Start by cloning this repo

git clone https://github.com/v-iashin/SpecVQGAN.git

Next, install the environment. For your convenience, we provide both conda and docker environments.

Conda

conda env create -f conda_env.yml

Test your environment

conda activate specvqgan
python -c "import torch; print(torch.cuda.is_available())"
# True

Docker

Download the image from Docker Hub and test if CUDA is available:

docker run \
    --mount type=bind,source=/absolute/path/to/SpecVQGAN/,destination=/home/ubuntu/SpecVQGAN/ \
    --mount type=bind,source=/absolute/path/to/logs/,destination=/home/ubuntu/SpecVQGAN/logs/ \
    --mount type=bind,source=/absolute/path/to/vggsound/features/,destination=/home/ubuntu/SpecVQGAN/data/vggsound/ \
    --shm-size 8G \
    -it --gpus '"device=0"' \
    iashin/specvqgan:latest \
    python
>>> import torch; print(torch.cuda.is_available())
# True

or build it yourself

docker build - < Dockerfile --tag specvqgan

Data

In this project, we used VAS and VGGSound datasets. VAS can be downloaded directly using the link provided in the RegNet repository. For VGGSound, however, one might need to retrieve videos directly from YouTube.

Download

The scripts will download features, check the md5 sum, unpack, and do a clean-up for each part of the dataset:

cd ./data
# 24GB
bash ./download_vas_features.sh
# 420GB (+ 420GB if you also need ResNet50 Features)
bash ./download_vggsound_features.sh

The unpacked features are going to be saved in ./data/downloaded_features/*. Move them to ./data/vas and ./data/vggsound such that the folder structure would match the structure of the demo files. By default, it will download BN Inception features, to download ResNet50 features uncomment the lines in scripts ./download_*_features.sh

If you wish to download the parts manually, use the following URL templates:

https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/vas/*.tar
https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/vggsound/*.tar

Also, make sure to check the md5 sums provided in ./data/md5sum_vas.md5 and ./data/md5sum_vggsound.md5 along with file names.

Note, we distribute features for the VGGSound dataset in 64 parts. Each part holds ~3k clips and can be used independently as a subset of the whole dataset (the parts are not class-stratified though).

Extract Features Manually

For BN Inception features, we employ the same procedure as RegNet.

For ResNet50 features, we rely on video_features (branch specvqgan) repository and used these commands:

# VAS (few hours on three 2080Ti)
strings=("dog" "fireworks" "drum" "baby" "gun" "sneeze" "cough" "hammer")
for class in "${strings[@]}"; do
    python main.py \
        --feature_type resnet50 \
        --device_ids 0 1 2 \
        --batch_size 86 \
        --extraction_fps 21.5 \
        --file_with_video_paths ./paths_to_mp4_${class}.txt \
        --output_path ./data/vas/features/${class}/feature_resnet50_dim2048_21.5fps \
        --on_extraction save_pickle
done

# VGGSound (6 days on three 2080Ti)
python main.py \
    --feature_type resnet50 \
    --device_ids 0 1 2 \
    --batch_size 86 \
    --extraction_fps 21.5 \
    --file_with_video_paths ./paths_to_mp4s.txt \
    --output_path ./data/vggsound/feature_resnet50_dim2048_21.5fps \
    --on_extraction save_pickle

Similar to BN Inception, we need to "tile" (cycle) a video if it is shorter than 10s. For ResNet50 we achieve this by tiling the resulting frame-level features up to 215 on temporal dimension, e.g. as follows:

feats = pickle.load(open(path, 'rb')).astype(np.float32)
reps = 1 + (215 // feats.shape[0])
feats = np.tile(feats, (reps, 1))[:215, :]
with open(new_path, 'wb') as file:
    pickle.dump(feats, file)

<!-- <details> <summary>Downloading VGGSound from Scratch</summary> 1. We will rely on the AudioSet download script. To adapt it, we refactor `vggsound.csv` using the following script such that can be used in a AudioSet downloader: ```python import pandas as pd VGGSOUND_PATH = './data/vggsound.csv' VGGSOUND_REF_PATH = './data/vggsound_ref.csv' vggsound_meta = pd.read_csv(VGGSOUND_PATH, names=['YTID', 'start_seconds', 'positive_labels', 'split']) vggsound_meta['end_seconds'] = vggsound_meta['start_seconds'] + 10 vggsound_meta = vggsound_meta.drop(['split'], axis=1) vggsound_meta = vggsound_meta[['YTID', 'start_seconds', 'end_seconds', 'positive_labels']] print(list(vggsound_meta.columns)) print(vggsound_meta.head()) vggsound_meta.to_csv(VGGSOUND_REF_PATH, sep=',', index=None, header=None) ``` 1. We also add 3 lines with `# placeholder` on top of the `vggsound_ref.csv` to match the style as AudioSet has some statistics there. 1. Rent an instance (GoogleCloud/AWS/Pouta), allocate an IP. Disk 800GB: 300 GB for video and 90 for audio + zipping + OS 1. `git clone https://github.com/marl/audiosetdl` and check out to `ebd89c5` commit. This code provides a script to download AudioSet in parallel on several CPUs. 1. Create a file with conda environment in `down_audioset.yaml` with content as follows: ```yaml name: down_audioset channels: - conda-forge - yaafe - defaults dependencies: - _libgcc_mutex=0.1=main - bzip2=1.0.8=h7b6447c_0 - ca-certificates=2020.6.24=0 - certifi=2020.6.20=py38_0 - ffmpeg=4.2.2=h20bf706_0 - freetype=2.10.2=h5ab3b9f_0 - gmp=6.1.2=h6c8ec71_1 - gnutls=3.6.5=h71b1129_1002 - lame=3.100=h7b6447c_0 - ld_impl_linux-64=2.33.1=h53a641e_7 - libedit=3.1.20191231=h7b6447c_0 - libffi=3.3=he6710b0_2 - libflac=1.3.1=0 - libgcc-ng=9.1.0=hdf63c60_0 - libogg=1.3.2=0 - libopus=1.3.1=h7b6447c_0 - libpng=1.6.37=hbc83047_0 - libstdcxx-ng=9.1.0=hdf63c60_0 - libvpx=1.7.0=h439df22_0 - mad=0.15.1b=he1b5a44_0 - ncurses=6.2=he6710b0_1 - nettle=3.4.1=hbb512f6_0 - openh264=2.1.0=hd408876_0 - openssl=1.1.1g=h516909a_0 - pip=20.1.1=py38_1 - python=3.8.3=hcff3b4d_2 - python_abi=3.8=1_cp38 - readline=8.0=h7b6447c_0 - setuptools=47.3.1=py38_0 - sqlite=3.32.3=h62c20be_0 - tk=8.6.10=hbc83047_0 - wheel=0.34.

SpecVQGAN

Install / Use

README