Speech Enhancement and Dereverberation with Diffusion-based Generative Models

This repository contains the official PyTorch implementations for the papers:

Simon Welker, Julius Richter, Timo Gerkmann, "Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain", ISCA Interspeech, Incheon, Korea, Sept. 2022. [bibtex]
Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Timo Gerkmann, "Speech Enhancement and Dereverberation with Diffusion-Based Generative Models", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2351-2364, 2023. [bibtex]
Jean-Marie Lemercier, Julius Richter, Simon Welker, Timo Gerkmann, "Analysing diffusion-based generative approaches versus discriminative approaches for speech restoration", ICASSP, Rhodes Island, Greece, 2023. [bibtex]
Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinji Watanabe, Alexander Richard, Timo Gerkmann, "EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation", ISCA Interspecch, Kos, Greece, Sept. 2024. [bibtex]
Julius Richter, Danilo de Oliveira, Timo Gerkmann, "Investigating Training Objectives for Generative Speech Enhancement", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hyderabad, India, April 2025. [bibtex]

Audio examples and supplementary materials are available on our SGMSE project page, EARS project page, and Investigating training objectives project page.

An interactive demo of generative speech enhancement using a juypter notebook can be found here.

Follow-up work

Please also check out our follow-up work with code available:

Jean-Marie Lemercier, Julius Richter, Simon Welker, Timo Gerkmann, "StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation", IEEE/ACM Transactions on Audio, Speech, Language Processing, vol. 31, pp. 2724 -2737, 2023. [github]
Bunlong Lay, Simon Welker, Julius Richter, Timo Gerkmann, "Reducing the Prior Mismatch of Stochastic Differential Equations for Diffusion-based Speech Enhancement", ISCA Interspeech, Dublin, Ireland, Aug. 2023. [github]

Installation

Create a new virtual environment with Python 3.11 (we have not tested other Python versions, but they may work).
Install the package dependencies via pip install -r requirements.txt.
- Let pip resolve the dependencies for you. If you encounter any issues, please check requirements_version.txt for the exact versions we used.
If using W&B logging (default):
- Set up a wandb.ai account
- Log in via wandb login before running our code.
If not using W&B logging:
- Pass the option --nolog to train.py.
- Your logs will be stored as local CSVLogger logs in lightning_logs/.

Pretrained checkpoints

For the speech enhancement task, we offer pretrained checkpoints for models that have been trained on the VoiceBank-DEMAND and WSJ0-CHiME3 datasets, as described in our journal paper [2]. You can download them here.
- SGMSE+ trained on VoiceBank-DEMAND: gdown 1_H3EXvhcYBhOZ9QNUcD5VZHc6ktrRbwQ
- SGMSE+ trained on WSJ0-CHiME3: gdown 16K4DUdpmLhDNC7pJhBBc08pkSIn_yMPi
For the dereverberation task, we offer a checkpoint trained on our WSJ0-REVERB dataset. You can download it here.
- SGMSE+ trained on WSJ0-REVERB: gdown 1eiOy0VjHh9V9ZUFTxu1Pq2w19izl9ejD
- Note that this checkpoint works better with sampler settings --N 50 --snr 0.33.
For 48 kHz models [3], we offer pretrained checkpoints for speech enhancement, trained on the EARS-WHAM dataset, and for dereverberation, trained on the EARS-Reverb dataset. You can download them here.
- SGMSE+ trained on EARS-WHAM: gdown 1t_DLLk8iPH6nj8M5wGeOP3jFPaz3i7K5
- SGMSE+ trained on EARS-Reverb: gdown 1PunXuLbuyGkknQCn_y-RCV2dTZBhyE3V
For the investigating training objectives checkpoints [4], we offer the pretrained checkpoints here.
- M1: wget https://www2.informatik.uni-hamburg.de/sp/audio/publications/icassp2025_gense/checkpoints/m1.ckpt
- M2: wget https://www2.informatik.uni-hamburg.de/sp/audio/publications/icassp2025_gense/checkpoints/m2.ckpt
- M3: wget https://www2.informatik.uni-hamburg.de/sp/audio/publications/icassp2025_gense/checkpoints/m3.ckpt
- M4: Please check our repo for EDM2SE
- M5: wget https://www2.informatik.uni-hamburg.de/sp/audio/publications/icassp2025_gense/checkpoints/m5.ckpt
- M6: wget https://www2.informatik.uni-hamburg.de/sp/audio/publications/icassp2025_gense/checkpoints/m6.ckpt
- M7: wget https://www2.informatik.uni-hamburg.de/sp/audio/publications/icassp2025_gense/checkpoints/m7.ckpt
- M8: wget https://www2.informatik.uni-hamburg.de/sp/audio/publications/icassp2025_gense/checkpoints/m8.ckpt
We offer a pretrained checkpoint for the Schrödinger bridge model trained on EARS-WHAM + VB-DMD dataset. You can download it here. The model is trained with: batch_size=16, devices=2, backbone="ncsnpp_v2", loss_type="data_prediction", sde="sbve", sr=16000.
We provide pretrained checkpoints for SGMSE+ and the SB trained on Singing-ReverbFX here.
- SGMSE+ (artificial RIR): wget https://www2.informatik.uni-hamburg.de/sp/audio/publications/itg2025-reverbfx/checkpoints/sgmse_artificial_rir_350k.ckpt
- SGMSE+ (natural RIR): wget https://www2.informatik.uni-hamburg.de/sp/audio/publications/itg2025-reverbfx/checkpoints/sgmse_natural_rir_350k.ckpt
- SB (artificial RIR): wget https://www2.informatik.uni-hamburg.de/sp/audio/publications/itg2025-reverbfx/checkpoints/sb_artificial_rir_350k.ckpt
- Please cite the ReverbFX dataset paper, when you make use of the data or the checkpoints:
```
@inproceedings{richter2025reverbfx,
  author={Julius Richter and Till Svajda and Timo Gerkmann},
  title={{ReverbFX}: A Dataset of Room Impulse Responses Derived from Reverb Effect Plugins for Singing Voice Dereverberation},
  year={2025},
  booktitle={ITG Conference on Speech Communication},
}
```

Usage:

For resuming training, you can use the --ckpt option of train.py.
For evaluating these checkpoints, use the --ckpt option of enhancement.py (see section Evaluation below).

Training

Training is done by executing train.py. A minimal running example with default settings (as in our paper [2]) can be run with

python train.py --base_dir <your_base_dir>

where your_base_dir should be a path to a folder containing subdirectories train/ and valid/ (optionally test/ as well). Each subdirectory must itself have two subdirectories clean/ and noisy/, with the same filenames present in both. We currently only support training with .wav files.

To see all available training options, run python train.py --help. Note that the available options for the SDE and the backbone network change depending on which SDE and backbone you use. These can be set through the --sde and --backbone options.

Note:

Our journal [2] uses --backbone ncsnpp.
For the 48 kHz model [3], use --backbone ncsnpp_48k --n_fft 1534 --hop_length 384 --spec_factor 0.065 --spec_abs_exponent 0.667 --sigma-min 0.1 --sigma-max 1.0 --theta 2.0
Our Interspeech paper [1] uses --backbone dcunet. You need to pass --n_fft 512 to make it work.
- Also note that the default parameters for the spectrogram transformation in this repository are slightly different from the ones listed in the first (Interspeech) paper (--spec_factor 0.15 rather than --spec_factor 0.333), but we've found the value in this repository to generally perform better for both models [1] and [2].
For the investigating training objectives paper [4], we use --backbone ncsnpp_v2.
For the Schrödinger bridge model [4], we use e.g. --backbone ncsnpp_v2 --sde sbve --loss_type data_prediction --pesq_weight 5e-4.

Evaluation

To evaluate on a test set, run

python enhancement.py --test_dir <your_test_dir> --enhanced_dir <your_enhanced_dir> --ckpt <path_to_model_checkpoint>

to generate the enhanced .wav files, a

Sgmse

Install / Use

README