SALSA

This is the public repository for eigenvector-based SALSA features for polyphonic sound event localization and detection.

Generate Convert Improve

Install / Use

/learn @thomeou/SALSA

About this skill

Quality Score

0/100

README

SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection

<img src="figures/foa_salsa_16s_tight_with_text.png" title="SALSA feature for first-order ambisonics microphone (FOA)" width="48%"> <img src="figures/mic_salsa_16s_tight_with_text.png" title="SALSA feature for microphone array (MIC)" width="48%"> Visualization of SALSA features of a 16-second audio clip in multi-source scenarios for first-order ambisonics microphone (FOA) (left) and 4-channel microphone array (MIC) (right).

Official implementation for SALSA: Spatial Cue-Augmented Log-Spectrogram Features for polyphonic sound event localization and detection.

Update: SALSA-Lite feature has been added to the repo.

Thi Ngoc Tho Nguyen, Karn N. Watcharasupat, Ngoc Khanh Nguyen, Douglas L. Jones, Woon-Seng Gan. SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022. [arXiv] [IEEE Xplore]

Thi Ngoc Tho Nguyen, Douglas L. Jones, Karn N. Watcharasupat, Huy Phan, Woon-Seng Gan. SALSA-Lite: A Fast and Effective Feature for Polyphonic Sound Event Localization and Detection with Microphone Arrays, in Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022. [arXiv] [IEEE Xplore]

Introduction to sound event localization and detection

Sound event localization and detection (SELD) is an emerging research field that unifies the tasks of sound event detection (SED) and direction-of-arrival estimation (DOAE) by jointly recognizing the sound classes, and estimating the directions of arrival (DOA), the onsets, and the offsets of detected sound events. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize these two subtasks.

What is SALSA?

We propose a novel feature called Spatial Cue-Augmented Log-Spectrogram (SALSA) with exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The SALSA feature consists of multichannel log-linear spectrograms stacked along with the normalized principal eigenvector of the spatial covariance matrix at each corresponding time-frequency bin. Depending on the types of microphone array, the principal eigenvector can be normalized differently to extract amplitude and/or phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as first-order ambisonics (FOA) and multichannel microphone array (MIC).

Experimental results on the TAU-NIGENS Spatial Sound Events (TNSSE) 2021 dataset with directional interferences showed that SALSA features outperformed other state-of-the-art features. Specifically, the use of SALSA features in the FOA format increased the F1 score and localization recall by 6% each, compared to the multichannel log-mel spectrograms with intensity vectors. For the MIC format, using SALSA features increased F1 score and localization recall by 16% and 7%, respectively, compared to using multichannel log-mel spectrograms with generalized cross-correlation spectra.

SELD performance

Our ensemble model trained on SALSA features ranked second in the team category of the SELD task in the 2021 DCASE SELD Challenge.

Network architecture

We use a convolutional recurrent neural network (CRNN) in this code. The network consists of a CNN that is based on ResNet22 for audio tagging, a two-layer BiGRU, and fully connected (FC) layers. The network can be adapted for different input features by setting the number of input channels in the first convolutional layer to that of the input features.

Visualization of SELD output

Visualization of ground truth and predicted azimuth for test clip fold6_room2_mix041 of the TAU-NIGENS Spatial Sound Events 2021 dataset. Legend lists the ground truth events in chronological order. Sound classes are color-coded.

SELD output_visualization

Comparison with state-of-the-art SELD systems

Simple CRNN models trained on SALSA features have shown to achieve similar to or even better SELD performance than many complex state-of-the-art systems on the 2020 and 2021 TNSSE datasets. We listed the performances of our models trained with the proposed SALSA features and other state-of-the-art SELD system in the following tables. For more results, please refer to the paper listed above.

Prepare dataset and environment

This code is tested on Ubuntu 18.04 with Python 3.7, CUDA 11.0 and Pytorch 1.7

Install the following dependencies by pip install -r requirements.txt. Or manually install these modules:
- numpy
- scipy
- pandas
- scikit-learn
- h5py
- librosa
- tqdm
- pytorch 1.7
- pytorch-lightning
- tensorboardx
- pyyaml
- munch
Download TAU-NIGENS Spatial Sound Events 2021 dataset here. This code also works with TAU-NIGENS Spatial Sound Events 2020 dataset here.
Extract everything into the same folder.
Data file structure should look like this:

./
├── feature_extraction.py
├── ...
└── data/
    ├──foa_dev
    │   ├── fold1_room1_mix001.wav
    │   ├── fold1_room1_mix002.wav  
    │   └── ...
    ├──foa_eval
    ├──metadata_dev
    ├──metadata_eval (might not be available yet)
    ├──mic_dev
    └──mic_eval

For TAU-NIGENS Spatial Sound Events 2021 dataset, please move wav files from subfolders dev_train, dev_val, dev_test to outer folder foa_dev or mic_dev.

Feature extraction

*Note: under contrib, you can find functionality to run SALSA and SALSA-Lite on the fly (CPU), and for an arbitrary number of microphones. This may be useful if you are interested in more flexible setups (e.g. involving data augmentation and real-time processing).

Our code support the following features:

| Name | Format | Component | Number of channels | | :--- | :----: | :--- | :----: | | melspeciv | FOA | multichannel log-mel spectrograms + intensity vector | 7 | | linspeciv | FOA | multichannel log-linear spectrograms + intensity vector | 7 | | melspecgcc | MIC | multichannel log-mel spectrograms + GCC-PHAT | 10 | | linspecgcc | MIC | multichannel log-linear spectrograms + GCC-PHAT | 10 | | SALSA | FOA | multichannel log-linear spectrograms + eigenvector-based intensity vector (EIV) | 7 | | SALSA | MIC | multichannel log-linear spectrograms + eigenvector-based phase vector (EPV) | 7 | | SALSA-IPD | MIC | multichannel log-linear spectrograms + interchannel phase difference (IPD) | 7 | | SALSA-Lite| MIC | multichannel log-linear spectrograms + normalized interchannel phase difference (NIPD) | 7 |

Note: the number of channels are calculated based on four-channel inputs.

To extract SALSA feature, edit directories for data and feature accordingly in tnsse_2021_salsa_feature_config.yml in dataset\configs\ folder. Then run make salsa

To extract SALSA-Lite feature, edit directories for data and feature accordingly in tnsse_2021_salsa_lite_feature_config.yml in dataset\configs\ folder. Then run make salsa-lite

To extract linspeciv, melspeciv, linspecgcc, melspecgcc feature, edit directories for data and feature accordingly in tnsse_2021_feature_config.yml in dataset\configs\ folder. Then run make feature

Training and inference

To train SELD model with SALSA feature, edit the feature_root_dir and gt_meta_root_dir in the experiment config experiments\configs\seld.yml. Then run make train.

To train SELD model with SALSA-Lite feature, edit the feature_root_dir and gt_meta_root_dir in the experiment config experiments\configs\seld_salsa_lite.yml. Then run make train.

To do inference, run make inference. To evaluate output, edit the Makefile accordingly and run make evaluate.

DCASE2021 Sound Event Localization and Detection Challenge

We participated in DCASE2021 Sound Event Localization and Detection Challenge. Our model ensemble ranked 2nd in the team ranking

Related Skills

node-connect

341.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

84.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

341.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

84.6k

Commit, push, and open a PR