SkillAgentSearch skills...

DeepXi

Deep Xi: A deep learning approach to a priori SNR estimation implemented in TensorFlow 2/Keras. For speech enhancement and robust ASR.

Install / Use

/learn @anicolson/DeepXi

README

<!-- export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} -->

Deep Xi: A Deep Learning Approach to A Priori SNR Estimation for speech enhancement.

News

New journal paper:

  • On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation [link] [.pdf]

New trained model:

  • A trained MHANet is available in the model directory.

New journal paper:

  • Masked Multi-Head Self-Attention for Causal Speech Enhancement [link] [.pdf]

New journal paper:

  • Spectral distortion level resulting in a just-noticeable difference between an a priori signal-to-noise ratio estimate and its instantaneous case [link] [.pdf]

New conference paper:

  • Temporal Convolutional Network with Frequency Dimension Adaptive Attention for Speech Enhancement (INTERSPEECH 2021)[link]

Contents

Introduction

Deep Xi is implemented in TensorFlow 2/Keras and can be used for speech enhancement, noise estimation, mask estimation, and as a front-end for robust ASR. Deep Xi (where the Greek letter 'xi' or ξ is pronounced /zaɪ/ and is the symbol used in the literature for the a priori SNR) is a deep learning approach to a priori SNR estimation that was proposed in [1]. Some of its use cases include:

  • Minimum mean-square error (MMSE) approaches to speech enhancement.
  • MMSE-based noise PSD estimators, as in DeepMMSE [2].
  • Ideal binary mask (IBM) estimation for missing feature approaches.
  • Ideal ratio mask (IRM) estimation for source separation.
  • Front-end for robust ASR
<!-- |![](./docs/fig_front-end.png "Deep Xi as a front-end for robust ASR.")| |----| | <p align="center"> <b>Figure 1:</b> Deep Xi used as a front-end for robust ASR. The back-end (Deep Speech) is available <a href="https://github.com/mozilla/DeepSpeech">here</a>. The noisy speech magnitude spectrogram, as shown in <b>(a)</b>, is a mixture of clean speech with <i>voice babble</i> noise at an SNR level of -5 dB, and is the input to Deep Xi. Deep Xi estimates the <i>a priori</i> SNR, as shown in <b>(b)</b>. The <i>a priori</i> SNR estimate is used to compute an MMSE approach gain function, which is multiplied elementwise with the noisy speech magnitude spectrum to produce the clean speech magnitude spectrum estimate, as shown in <b>(c)</b>. <a href="https://github.com/anicolson/matlab_feat">MFCCs</a> are computed from the estimated clean speech magnitude spectrogram, producing the estimated clean speech cepstrogram, as shown in <b>(d)</b>. The back-end system, Deep Speech, computes the hypothesis transcript, from the estimated clean speech cepstrogram, as shown in <b>(e)</b>. </p> | -->

How does Deep Xi work?

A training example is shown in Figure 2. A deep neural network (DNN) within the Deep Xi framework is fed the noisy-speech short-time magnitude spectrum as input. The training target of the DNN is a mapped version of the instantaneous a priori SNR (i.e. mapped a priori SNR). The instantaneous a priori SNR is mapped to the interval [0,1] to improve the rate of convergence of the used stochastic gradient descent algorithm. The map is the cumulative distribution function (CDF) of the instantaneous a priori SNR, as given by Equation (13) in [1]. The statistics for the CDF are computed over a sample of the training set. An example of the mean and standard deviation of the sample for each frequency bin is shown in Figure 3. The training examples in each mini-batch are padded to the longest sequence length in the mini-batch. The sequence mask is used by TensorFlow to ensure that the DNN is not trained on the padding. During inference, the a priori SNR estimate is computed from the mapped a priori SNR using the sample statistics and Equation (12) from [2].

|| |----| | <p align="center"> <b>Figure 2:</b> <a>A training example for Deep Xi. Generated using eval_example.m.</a> </p> |

|| |----| | <p align="center"> <b>Figure 3:</b> <a>The normal distribution for each frequency bin is computed from the mean and standard deviation of the instantaneous a priori SNR (dB) over a sample of the training set. Generated using eval_stats.m</a> </p> |

Current networks

Configurations for the following networks can be found in run.sh.

  • MHANet: Multi-head attention network [6].
  • RDLNet: Residual-dense lattice network [3].
  • ResNet: Residual network [2].
  • ResLSTM & ResBiLSTM: Residual long short-term memory (LSTM) network and residual bidirectional LSTM (ResBiLSTM) network [1].

Deep Xi utilising the MHANet (Deep Xi-MHANet) was proposed in [6]. It utilises multi-head attention to efficiently model the long-range dependencies of noisy speech. Deep Xi-MHANet is shown in Figure 4. Deep Xi utilising a ResNet TCN (Deep Xi-ResNet) was proposed in [2]. It uses bottleneck residual blocks and a cyclic dilation rate. The network comprises of approximately 2 million parameters and has a contextual field of approximately 8 seconds. Deep Xi utilising a ResLSTM network (Deep Xi-ResLSTM) was proposed in [1]. Each of its residual blocks contain a single LSTM cell. The network comprises of approximately 10 million parameters.

|| |----| | <p align="center"> <b>Figure 4:</b> <a> <b>(left)</b> Deep Xi-MHANet from [6].</a></p> |

Available models

mhanet-1.1c (available in the model directory)

resnet-1.1n (available in the model directory)

resnet-1.1c (available in the model directory)

Each available model is trained using the Deep Xi dataset. Please see run.sh for more details about these networks.

There are multiple Deep Xi versions, comprising of different networks and restrictions. An example of the ver naming convention is resnet-1.0c. The network type is given at the start of ver. Versions with c are causal. Versions with n are non-causal. The version iteration is also given, i.e. 1.0.

Results

Note: Results for the Deep Xi framework in this repository are reported for Tensorflow 2/Keras. Results in the papers were found using Tensorflow 1. All future work will be completed in Tensorflow 2/Keras.

DEMAND Voice Bank test set

Objective scores obtained on the DEMAND Voicebank test set described here. Each Deep Xi model is trained on the DEMAND Voicebank training set. As in previous works, the objective scores are averaged over all tested conditions. CSIG, CBAK, and COVL are mean opinion score (MOS) predictors of the signal distortion, background-noise intrusiveness, and overall signal quality, respectively. PESQ is the perceptual evaluation of speech quality measure. STOI is the short-time objective intelligibility measure (in %). The highest scores attained for each measure are indicated in boldface.

| Method | Gain | Causal | CSIG | CBAK | COVL | PESQ | STOI | SegSNR | |-------------------------|---|--------|------|------|------|------|-----------|----| | Noisy speech | -- | -- | 3.35 | 2.44 | 2.63 | 1.97 | 92 (91.5) | -- | | Wiener | | Yes | 3.23 | 2.68 | 2.67 | 2.22 | -- | -- | | SEGAN | -- | No | 3.48 | 2.94 | 2.80 | 2.16 | 93 | -- | | WaveNet | -- | No | 3.62 | 3.23 | 2.98 | -- | -- | --| | [MMSE-GAN](https://ieeexplo

View on GitHub
GitHub Stars523
CategoryEducation
Updated9d ago
Forks126

Languages

MATLAB

Security Score

100/100

Audited on Mar 18, 2026

No findings