SkillAgentSearch skills...

SpecVQGAN

Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)

Install / Use

/learn @v-iashin/SpecVQGAN

README

Taming Visually Guided Sound Generation

BMVC 2021 – Oral Presentation

• [Project Page] • [ArXiv] • [BMVC Proceedings] • [Poster (for PAISS)] • [Presentation on YouTube] (Can't watch YouTube?) •

Open In Colab

<img src="https://github.com/v-iashin/v-iashin.github.io/raw/master/images/specvqgan/specvqgan_vggsound_samples.jpg" alt="Generated Samples Using our Model" width="900">

Listen for the samples on our project page.

Overview

We propose to tame the visually guided sound generation by shrinking a training dataset to a set of representative vectors aka. a codebook. These codebook vectors can, then, be controllably sampled to form a novel sound given a set of visual cues as a prime.

The codebook is trained on spectrograms similarly to VQGAN (an upgraded VQVAE). We refer to it as Spectrogram VQGAN

<img src="https://github.com/v-iashin/v-iashin.github.io/raw/master/images/specvqgan/codebook.svg" alt="Spectrogram VQGAN" width="900">

Once the spectrogram codebook is trained, we can train a transformer (a variant of GPT-2) to autoregressively sample the codebook entries as tokens conditioned on a set of visual features

<img src="https://github.com/v-iashin/v-iashin.github.io/raw/master/images/specvqgan/transformer.svg" alt="Vision-based Conditional Cross-modal Autoregressive Sampler" width="900">

This approach allows training a spectrogram generation model which produces long, relevant, and high-fidelity sounds while supporting tens of data classes.

<!-- The link to this section is used in demo.ipynb -->

Environment Preparation

During experimentation, we used Linux machines with conda virtual environments, PyTorch 1.8 and CUDA 11.

Start by cloning this repo

git clone https://github.com/v-iashin/SpecVQGAN.git

Next, install the environment. For your convenience, we provide both conda and docker environments.

Conda

conda env create -f conda_env.yml

Test your environment

conda activate specvqgan
python -c "import torch; print(torch.cuda.is_available())"
# True

Docker

Download the image from Docker Hub and test if CUDA is available:

docker run \
    --mount type=bind,source=/absolute/path/to/SpecVQGAN/,destination=/home/ubuntu/SpecVQGAN/ \
    --mount type=bind,source=/absolute/path/to/logs/,destination=/home/ubuntu/SpecVQGAN/logs/ \
    --mount type=bind,source=/absolute/path/to/vggsound/features/,destination=/home/ubuntu/SpecVQGAN/data/vggsound/ \
    --shm-size 8G \
    -it --gpus '"device=0"' \
    iashin/specvqgan:latest \
    python
>>> import torch; print(torch.cuda.is_available())
# True

or build it yourself

docker build - < Dockerfile --tag specvqgan

Data

In this project, we used VAS and VGGSound datasets. VAS can be downloaded directly using the link provided in the RegNet repository. For VGGSound, however, one might need to retrieve videos directly from YouTube.

Download

The scripts will download features, check the md5 sum, unpack, and do a clean-up for each part of the dataset:

cd ./data
# 24GB
bash ./download_vas_features.sh
# 420GB (+ 420GB if you also need ResNet50 Features)
bash ./download_vggsound_features.sh

The unpacked features are going to be saved in ./data/downloaded_features/*. Move them to ./data/vas and ./data/vggsound such that the folder structure would match the structure of the demo files. By default, it will download BN Inception features, to download ResNet50 features uncomment the lines in scripts ./download_*_features.sh

If you wish to download the parts manually, use the following URL templates:

  • https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/vas/*.tar
  • https://a3s.fi/swift/v1/AUTH_a235c0f452d648828f745589cde1219a/specvqgan_public/vggsound/*.tar

Also, make sure to check the md5 sums provided in ./data/md5sum_vas.md5 and ./data/md5sum_vggsound.md5 along with file names.

Note, we distribute features for the VGGSound dataset in 64 parts. Each part holds ~3k clips and can be used independently as a subset of the whole dataset (the parts are not class-stratified though).

Extract Features Manually

For BN Inception features, we employ the same procedure as RegNet.

For ResNet50 features, we rely on video_features (branch specvqgan) repository and used these commands:

# VAS (few hours on three 2080Ti)
strings=("dog" "fireworks" "drum" "baby" "gun" "sneeze" "cough" "hammer")
for class in "${strings[@]}"; do
    python main.py \
        --feature_type resnet50 \
        --device_ids 0 1 2 \
        --batch_size 86 \
        --extraction_fps 21.5 \
        --file_with_video_paths ./paths_to_mp4_${class}.txt \
        --output_path ./data/vas/features/${class}/feature_resnet50_dim2048_21.5fps \
        --on_extraction save_pickle
done

# VGGSound (6 days on three 2080Ti)
python main.py \
    --feature_type resnet50 \
    --device_ids 0 1 2 \
    --batch_size 86 \
    --extraction_fps 21.5 \
    --file_with_video_paths ./paths_to_mp4s.txt \
    --output_path ./data/vggsound/feature_resnet50_dim2048_21.5fps \
    --on_extraction save_pickle

Similar to BN Inception, we need to "tile" (cycle) a video if it is shorter than 10s. For ResNet50 we achieve this by tiling the resulting frame-level features up to 215 on temporal dimension, e.g. as follows:

feats = pickle.load(open(path, 'rb')).astype(np.float32)
reps = 1 + (215 // feats.shape[0])
feats = np.tile(feats, (reps, 1))[:215, :]
with open(new_path, 'wb') as file:
    pickle.dump(feats, file)
<!-- <details> <summary>Downloading VGGSound from Scratch</summary> 1. We will rely on the AudioSet download script. To adapt it, we refactor `vggsound.csv` using the following script such that can be used in a AudioSet downloader: ```python import pandas as pd VGGSOUND_PATH = './data/vggsound.csv' VGGSOUND_REF_PATH = './data/vggsound_ref.csv' vggsound_meta = pd.read_csv(VGGSOUND_PATH, names=['YTID', 'start_seconds', 'positive_labels', 'split']) vggsound_meta['end_seconds'] = vggsound_meta['start_seconds'] + 10 vggsound_meta = vggsound_meta.drop(['split'], axis=1) vggsound_meta = vggsound_meta[['YTID', 'start_seconds', 'end_seconds', 'positive_labels']] print(list(vggsound_meta.columns)) print(vggsound_meta.head()) vggsound_meta.to_csv(VGGSOUND_REF_PATH, sep=',', index=None, header=None) ``` 1. We also add 3 lines with `# placeholder` on top of the `vggsound_ref.csv` to match the style as AudioSet has some statistics there. 1. Rent an instance (GoogleCloud/AWS/Pouta), allocate an IP. Disk 800GB: 300 GB for video and 90 for audio + zipping + OS 1. `git clone https://github.com/marl/audiosetdl` and check out to `ebd89c5` commit. This code provides a script to download AudioSet in parallel on several CPUs. 1. Create a file with conda environment in `down_audioset.yaml` with content as follows: ```yaml name: down_audioset channels: - conda-forge - yaafe - defaults dependencies: - _libgcc_mutex=0.1=main - bzip2=1.0.8=h7b6447c_0 - ca-certificates=2020.6.24=0 - certifi=2020.6.20=py38_0 - ffmpeg=4.2.2=h20bf706_0 - freetype=2.10.2=h5ab3b9f_0 - gmp=6.1.2=h6c8ec71_1 - gnutls=3.6.5=h71b1129_1002 - lame=3.100=h7b6447c_0 - ld_impl_linux-64=2.33.1=h53a641e_7 - libedit=3.1.20191231=h7b6447c_0 - libffi=3.3=he6710b0_2 - libflac=1.3.1=0 - libgcc-ng=9.1.0=hdf63c60_0 - libogg=1.3.2=0 - libopus=1.3.1=h7b6447c_0 - libpng=1.6.37=hbc83047_0 - libstdcxx-ng=9.1.0=hdf63c60_0 - libvpx=1.7.0=h439df22_0 - mad=0.15.1b=he1b5a44_0 - ncurses=6.2=he6710b0_1 - nettle=3.4.1=hbb512f6_0 - openh264=2.1.0=hd408876_0 - openssl=1.1.1g=h516909a_0 - pip=20.1.1=py38_1 - python=3.8.3=hcff3b4d_2 - python_abi=3.8=1_cp38 - readline=8.0=h7b6447c_0 - setuptools=47.3.1=py38_0 - sqlite=3.32.3=h62c20be_0 - tk=8.6.10=hbc83047_0 - wheel=0.34.
View on GitHub
GitHub Stars371
CategoryProduct
Updated4d ago
Forks39

Languages

Jupyter Notebook

Security Score

100/100

Audited on Mar 23, 2026

No findings