Deep Xi: A Deep Learning Approach to A Priori SNR Estimation for speech enhancement.

News

New journal paper:

On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation [link] [.pdf]

New trained model:

A trained MHANet is available in the model directory.

New journal paper:

Masked Multi-Head Self-Attention for Causal Speech Enhancement [link] [.pdf]

New journal paper:

Spectral distortion level resulting in a just-noticeable difference between an a priori signal-to-noise ratio estimate and its instantaneous case [link] [.pdf]

New conference paper:

Temporal Convolutional Network with Frequency Dimension Adaptive Attention for Speech Enhancement (INTERSPEECH 2021)[link]

Introduction
How does Deep Xi work?
Current networks
Available models
Results
DeepMMSE
Installation
How to use Deep Xi
Current issues and potential areas of improvement
Where can I get a dataset for Deep Xi?
Which audio do I use with Deep Xi?
Citation guide

Introduction

Deep Xi is implemented in TensorFlow 2/Keras and can be used for speech enhancement, noise estimation, mask estimation, and as a front-end for robust ASR. Deep Xi (where the Greek letter 'xi' or ξ is pronounced /zaɪ/ and is the symbol used in the literature for the a priori SNR) is a deep learning approach to a priori SNR estimation that was proposed in [1]. Some of its use cases include:

Minimum mean-square error (MMSE) approaches to speech enhancement.
MMSE-based noise PSD estimators, as in DeepMMSE [2].
Ideal binary mask (IBM) estimation for missing feature approaches.
Ideal ratio mask (IRM) estimation for source separation.
Front-end for robust ASR

How does Deep Xi work?

A training example is shown in Figure 2. A deep neural network (DNN) within the Deep Xi framework is fed the noisy-speech short-time magnitude spectrum as input. The training target of the DNN is a mapped version of the instantaneous a priori SNR (i.e. mapped a priori SNR). The instantaneous a priori SNR is mapped to the interval [0,1] to improve the rate of convergence of the used stochastic gradient descent algorithm. The map is the cumulative distribution function (CDF) of the instantaneous a priori SNR, as given by Equation (13) in [1]. The statistics for the CDF are computed over a sample of the training set. An example of the mean and standard deviation of the sample for each frequency bin is shown in Figure 3. The training examples in each mini-batch are padded to the longest sequence length in the mini-batch. The sequence mask is used by TensorFlow to ensure that the DNN is not trained on the padding. During inference, the a priori SNR estimate is computed from the mapped a priori SNR using the sample statistics and Equation (12) from [2].

|| |----| | Figure 2: <a>A training example for Deep Xi. Generated using eval_example.m.</a> |

|| |----| | Figure 3: <a>The normal distribution for each frequency bin is computed from the mean and standard deviation of the instantaneous a priori SNR (dB) over a sample of the training set. Generated using eval_stats.m</a> |

Current networks

Configurations for the following networks can be found in run.sh.

MHANet: Multi-head attention network [6].
RDLNet: Residual-dense lattice network [3].
ResNet: Residual network [2].
ResLSTM & ResBiLSTM: Residual long short-term memory (LSTM) network and residual bidirectional LSTM (ResBiLSTM) network [1].

Deep Xi utilising the MHANet (Deep Xi-MHANet) was proposed in [6]. It utilises multi-head attention to efficiently model the long-range dependencies of noisy speech. Deep Xi-MHANet is shown in Figure 4. Deep Xi utilising a ResNet TCN (Deep Xi-ResNet) was proposed in [2]. It uses bottleneck residual blocks and a cyclic dilation rate. The network comprises of approximately 2 million parameters and has a contextual field of approximately 8 seconds. Deep Xi utilising a ResLSTM network (Deep Xi-ResLSTM) was proposed in [1]. Each of its residual blocks contain a single LSTM cell. The network comprises of approximately 10 million parameters.

|| |----| | Figure 4: <a> (left) Deep Xi-MHANet from [6].</a> |

Available models

mhanet-1.1c (available in the model directory)

resnet-1.1n (available in the model directory)

resnet-1.1c (available in the model directory)

Each available model is trained using the Deep Xi dataset. Please see run.sh for more details about these networks.

There are multiple Deep Xi versions, comprising of different networks and restrictions. An example of the ver naming convention is resnet-1.0c. The network type is given at the start of ver. Versions with c are causal. Versions with n are non-causal. The version iteration is also given, i.e. 1.0.

Results

Note: Results for the Deep Xi framework in this repository are reported for Tensorflow 2/Keras. Results in the papers were found using Tensorflow 1. All future work will be completed in Tensorflow 2/Keras.

DEMAND Voice Bank test set

Objective scores obtained on the DEMAND Voicebank test set described here. Each Deep Xi model is trained on the DEMAND Voicebank training set. As in previous works, the objective scores are averaged over all tested conditions. CSIG, CBAK, and COVL are mean opinion score (MOS) predictors of the signal distortion, background-noise intrusiveness, and overall signal quality, respectively. PESQ is the perceptual evaluation of speech quality measure. STOI is the short-time objective intelligibility measure (in %). The highest scores attained for each measure are indicated in boldface.

| Method | Gain | Causal | CSIG | CBAK | COVL | PESQ | STOI | SegSNR | |-------------------------|---|--------|------|------|------|------|-----------|----| | Noisy speech | -- | -- | 3.35 | 2.44 | 2.63 | 1.97 | 92 (91.5) | -- | | Wiener | | Yes | 3.23 | 2.68 | 2.67 | 2.22 | -- | -- | | SEGAN | -- | No | 3.48 | 2.94 | 2.80 | 2.16 | 93 | -- | | WaveNet | -- | No | 3.62 | 3.23 | 2.98 | -- | -- | --| | [MMSE-GAN](https://ieeexplo

DeepXi

Install / Use

README

Deep Xi: A Deep Learning Approach to A Priori SNR Estimation for speech enhancement.

News

Contents

Introduction

How does Deep Xi work?

Current networks

Available models

Results