SpecVQGAN
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
Install / Use
/learn @v-iashin/SpecVQGANREADME
Taming Visually Guided Sound Generation
BMVC 2021 – Oral Presentation
• [Project Page] • [ArXiv] • [BMVC Proceedings] • [Poster (for PAISS)] • [Presentation on YouTube] (Can't watch YouTube?) •
<img src="https://github.com/v-iashin/v-iashin.github.io/raw/master/images/specvqgan/specvqgan_vggsound_samples.jpg" alt="Generated Samples Using our Model" width="900">Listen for the samples on our project page.
Overview
We propose to tame the visually guided sound generation by shrinking a training dataset to a set of representative vectors aka. a codebook. These codebook vectors can, then, be controllably sampled to form a novel sound given a set of visual cues as a prime.
The codebook is trained on spectrograms similarly to VQGAN (an upgraded VQVAE). We refer to it as Spectrogram VQGAN
<img src="https://github.com/v-iashin/v-iashin.github.io/raw/master/images/specvqgan/codebook.svg" alt="Spectrogram VQGAN" width="900">Once the spectrogram codebook is trained, we can train a transformer (a variant of GPT-2) to autoregressively sample the codebook entries as tokens conditioned on a set of visual features
<img src="https://github.com/v-iashin/v-iashin.github.io/raw/master/images/specvqgan/transformer.svg" alt="Vision-based Conditional Cross-modal Autoregressive Sampler" width="900">This approach allows training a spectrogram generation model which produces long, relevant, and high-fidelity sounds while supporting tens of data classes.
- Taming Visually Guided Sound Generation
- Overview
- Environment Preparation
- Data
- Pretrained Models
- Training
- Evaluation
- Sampling Tool
- The Neural Audio Codec Demo
- Citation
- Acknowledgments
Environment Preparation
During experimentation, we used Linux machines with conda virtual environments, PyTorch 1.8 and CUDA 11.
Start by cloning this repo
git clone https://github.com/v-iashin/SpecVQGAN.git
Next, install the environment.
For your convenience, we provide both conda and docker environments.
Conda
conda env create -f conda_env.yml
Test your environment
conda activate specvqgan
python -c "import torch; print(torch.cuda.is_available())"
# True
Docker
Download the image from Docker Hub and test if CUDA is available:
docker run \
--mount type=bind,source=/absolute/path/to/SpecVQGAN/,destination=/home/ubuntu/SpecVQGAN/ \
--mount type=bind,source=/absolute/path/to/logs/,destination=/home/ubuntu/SpecVQGAN/logs/ \
--mount type=bind,source=/absolute/path/to/vggsound/features/,destination=/home/ubuntu/SpecVQGAN/data/vggsound/ \
--shm-size 8G \
-it --gpus '"device=0"' \
iashin/specvqgan:latest \
python
>>> import torch; print(torch.cuda.is_available())
# True
or build it yourself
docker build - < Dockerfile --tag specvqgan
Data
In this project, we used VAS and VGGSound datasets. VAS can be downloaded directly using the link provided in the RegNet repository. For VGGSound, however, one might need to retrieve videos directly from YouTube.
Download
The scripts will download features, check the md5 sum, unpack, and do a clean-up for each part of the dataset:
cd ./data
# 24GB
bash ./download_vas_features.sh
# 420GB (+ 420GB if you also need ResNet50 Features)
bash ./download_vggsound_features.sh
The unpacked features are going to be saved in ./data/downloaded_features/*.
Move them to ./data/vas and ./data/vggsound such that the folder structure would match the structure of the demo files.
By default, it will download BN Inception features, to download ResNet50 features uncomment the lines in scripts ./download_*_features.sh
If you wish to download the parts manually, use the following URL templates:
https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/vas/*.tarhttps://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/vggsound/*.tar
Also, make sure to check the md5 sums provided in ./data/md5sum_vas.md5 and ./data/md5sum_vggsound.md5 along with file names.
Note, we distribute features for the VGGSound dataset in 64 parts. Each part holds ~3k clips and can be used independently as a subset of the whole dataset (the parts are not class-stratified though).
Extract Features Manually
For BN Inception features, we employ the same procedure as RegNet.
For ResNet50 features, we rely on video_features (branch specvqgan)
repository and used these commands:
# VAS (few hours on three 2080Ti)
strings=("dog" "fireworks" "drum" "baby" "gun" "sneeze" "cough" "hammer")
for class in "${strings[@]}"; do
python main.py \
--feature_type resnet50 \
--device_ids 0 1 2 \
--batch_size 86 \
--extraction_fps 21.5 \
--file_with_video_paths ./paths_to_mp4_${class}.txt \
--output_path ./data/vas/features/${class}/feature_resnet50_dim2048_21.5fps \
--on_extraction save_pickle
done
# VGGSound (6 days on three 2080Ti)
python main.py \
--feature_type resnet50 \
--device_ids 0 1 2 \
--batch_size 86 \
--extraction_fps 21.5 \
--file_with_video_paths ./paths_to_mp4s.txt \
--output_path ./data/vggsound/feature_resnet50_dim2048_21.5fps \
--on_extraction save_pickle
Similar to BN Inception, we need to "tile" (cycle) a video if it is shorter than 10s. For
ResNet50 we achieve this by tiling the resulting frame-level features up to 215 on temporal dimension, e.g. as follows:
feats = pickle.load(open(path, 'rb')).astype(np.float32)
reps = 1 + (215 // feats.shape[0])
feats = np.tile(feats, (reps, 1))[:215, :]
with open(new_path, 'wb') as file:
pickle.dump(feats, file)
<!-- <details>
<summary>Downloading VGGSound from Scratch</summary>
1. We will rely on the AudioSet download script. To adapt it, we refactor `vggsound.csv` using the following script such that can be used in a AudioSet downloader:
```python
import pandas as pd
VGGSOUND_PATH = './data/vggsound.csv'
VGGSOUND_REF_PATH = './data/vggsound_ref.csv'
vggsound_meta = pd.read_csv(VGGSOUND_PATH, names=['YTID', 'start_seconds', 'positive_labels', 'split'])
vggsound_meta['end_seconds'] = vggsound_meta['start_seconds'] + 10
vggsound_meta = vggsound_meta.drop(['split'], axis=1)
vggsound_meta = vggsound_meta[['YTID', 'start_seconds', 'end_seconds', 'positive_labels']]
print(list(vggsound_meta.columns))
print(vggsound_meta.head())
vggsound_meta.to_csv(VGGSOUND_REF_PATH, sep=',', index=None, header=None)
```
1. We also add 3 lines with `# placeholder` on top of the `vggsound_ref.csv` to match the style as AudioSet has
some statistics there.
1. Rent an instance (GoogleCloud/AWS/Pouta), allocate an IP. Disk 800GB: 300 GB for video and 90 for audio + zipping + OS
1. `git clone https://github.com/marl/audiosetdl` and check out to `ebd89c5` commit. This code provides a script to download AudioSet in parallel on several CPUs.
1. Create a file with conda environment in `down_audioset.yaml` with content as follows:
```yaml
name: down_audioset
channels:
- conda-forge
- yaafe
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- bzip2=1.0.8=h7b6447c_0
- ca-certificates=2020.6.24=0
- certifi=2020.6.20=py38_0
- ffmpeg=4.2.2=h20bf706_0
- freetype=2.10.2=h5ab3b9f_0
- gmp=6.1.2=h6c8ec71_1
- gnutls=3.6.5=h71b1129_1002
- lame=3.100=h7b6447c_0
- ld_impl_linux-64=2.33.1=h53a641e_7
- libedit=3.1.20191231=h7b6447c_0
- libffi=3.3=he6710b0_2
- libflac=1.3.1=0
- libgcc-ng=9.1.0=hdf63c60_0
- libogg=1.3.2=0
- libopus=1.3.1=h7b6447c_0
- libpng=1.6.37=hbc83047_0
- libstdcxx-ng=9.1.0=hdf63c60_0
- libvpx=1.7.0=h439df22_0
- mad=0.15.1b=he1b5a44_0
- ncurses=6.2=he6710b0_1
- nettle=3.4.1=hbb512f6_0
- openh264=2.1.0=hd408876_0
- openssl=1.1.1g=h516909a_0
- pip=20.1.1=py38_1
- python=3.8.3=hcff3b4d_2
- python_abi=3.8=1_cp38
- readline=8.0=h7b6447c_0
- setuptools=47.3.1=py38_0
- sqlite=3.32.3=h62c20be_0
- tk=8.6.10=hbc83047_0
- wheel=0.34.