Multisensory
Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
Install / Use
/learn @andrewowens/MultisensoryREADME
This repository contains code for the paper:
Andrew Owens, Alexei A. Efros. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. arXiv, 2018
Contents
This release includes code and models for:
- On/off-screen source separation: separating the speech of an on-screen speaker from background sounds.
- Blind source separation: audio-only source separation using u-net and PIT.
- Sound source localization: visualizing the parts of a video that correspond to sound-making actions.
- Self-supervised audio-visual features: a pretrained 3D CNN that can be used for downstream tasks (e.g. action recognition, source separation).
Setup
- Install Python 2.7
- Install ffmpeg
- Install TensorFlow, e.g. through pip:
pip install tensorflow # for CPU evaluation only
pip install tensorflow-gpu # for GPU support
We used TensorFlow version 1.8, which can be installed with:
pip install tensorflow-gpu==1.8
- Install other python dependencies
pip install numpy matplotlib pillow scipy
- Download the pretrained models and sample data
./download_models.sh
./download_sample_data.sh
Pretrained audio-visual features
We have provided the features for our fused audio-visual network. These features were learned through self-supervised learning. Please see shift_example.py for a simple example that uses these pretrained features.
Audio-visual source separation
To try the on/off-screen source separation model, run:
python sep_video.py ../data/translator.mp4 --model full --duration_mult 4 --out ../results/
This will separate a speaker's voice from that of an off-screen speaker. It will write the separated video files to ../results/, and will also display them in a local webpage, for easier viewing. This produces the following videos (click to watch):
| Input | On-screen | Off-screen | | ----- | --------- | ---------- | | <a href = "https://youtu.be/4kVNzxFeboo"><img src = "doc/translator_input.jpg" width = 200></a> | <a href = "https://youtu.be/XvJVXsHyBKw"><img src = "doc/translator_input.jpg" width = 200></a> | <a href = "https://youtu.be/NFll7nfmwO8"><img src = "doc/translator_input.jpg" width = 200></a> |
We can visually mask out one of the two on-screen speakers, thereby removing their voice:
python sep_video.py ../data/crossfire.mp4 --model full --mask l --out ../results/
python sep_video.py ../data/crossfire.mp4 --model full --mask r --out ../results/
This produces the following videos (click to watch):
| Source | Left | Right | | ------ | ---- | ----- | | <a href = "https://youtu.be/H9CgWJToF_s"><img src="doc/crossfire_input.jpg" width="200"/></a> | <a href = "https://youtu.be/9jPaA8ttI6A"><img src="doc/crossfire_l.jpg" width="200"/></a> | <a href = "https://youtu.be/M4ACgIWuiWM"><img src="doc/crossfire_r.jpg" width="200"/></a> |
Blind (audio-only) source separation
This baseline trains a u-net model to minimize a permutation invariant loss.
python sep_video.py ../data/translator.mp4 --model unet_pit --duration_mult 4 --out ../results/
The model will write the two separated streams in an arbitrary order.
Visualizing the locations of sound sources
To view the self-supervised network's class activation map (CAM), use the --cam flag:
python sep_video.py ../data/translator.mp4 --model full --cam --out ../results/
This produces a video in which the CAM is overlaid as a heat map:
<a href = "https://youtu.be/u99MdLBDnJc"><img src="doc/crossfire_cam.jpg" width="300"/></a>
Action recognition and fine-tuning
We have provided example code for training an action recognition model (e.g. on the UCF-101 dataset) in videocls.py). This involves fine-tuning our pretrained, audio-visual network. It is also possible to train this network with only visual data (no audio).
Citation
If you use this code in your research, please consider citing our paper:
@article{multisensory2018,
title={Audio-Visual Scene Analysis with Self-Supervised Multisensory Features},
author={Owens, Andrew and Efros, Alexei A},
journal={arXiv preprint arXiv:1804.03641},
year={2018}
}
Updates
- 11/08/18: Fixed a bug in the class activation map example code. Added Tensorflow 1.9 compatibility.
Acknowledgements
Our u-net code draws from this implementation of pix2pix.
Related Skills
node-connect
336.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
336.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.9kCommit, push, and open a PR
