Clpcnet
Pitch-shifting, time-stretching, and vocoding of speech with Controllable LPCNet (CLPCNet)
Install / Use
/learn @maxrmorrison/ClpcnetREADME
Controllable LPCNet
Official repository for the paper "Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet" [1]. Performs pitch-shifting and time-stretching of speech recordings. Audio examples can be found here. The original LPCNet [2] code can be found here. If you use this code in an academic publication, please cite our paper.
Table of contents
Installation
With Docker
Docker installation assumes recent versions of Docker and NVidia Docker are installed.
In order to perform variable-ratio time-stretching, you must first download HTK 3.4.0 (see download instructions), which is used for forced phoneme alignment. Note that you only have to download HTK. You do not have to install it locally. HTK must be downloaded to within this directory in order to be considered part of the build context.
Next, build the image.
docker build --tag clpcnet --build-arg HTK=<path_to_htk> .
Now we can run a command within the Docker image.
docker run -itd --rm --name "clpcnet" --shm-size 32g --gpus all \
-v <absolute_path_of_runs_directory>:/clpcnet/runs \
-v <absolute_path_of_data_directory>:/clpcnet/data \
clpcnet:latest \
<command>
Where <command> is the command you would like to execute within the
container, prefaced with the correct Python path within the Docker
image (e.g., /opt/conda/envs/clpcnet/bin/python -m clpcnet.train --gpu 0).
Without Docker
Installation assumes we start from a clean install of Ubuntu 18 or 20 with a
recent CUDA driver and conda.
Install the apt-get dependencies.
sudo apt-get update && \
sudo apt-get install -y \
ffmpeg \
gcc-multilib \
libsndfile1 \
sox
Build the C preprocessing code
make
Create a new conda environment and install conda dependencies
conda create -n clpcnet python=3.7 cudatoolkit=10.0 cudnn=7.6 -y
conda activate clpcnet
Finally, install the Python dependencies
pip install -e .
If you would like to perform variable-ratio time-stretching, you must also download and install HTK 3.4.0 (see download instructions), which is used for forced phoneme alignment.
Inference
clpcnet can be used as a library (via import clpcnet) or
as an application accessed via the command-line.
Library inference
To perform pitch-shifting or time-stretching on audio already loaded into
memory, use clpcnet.from_audio. To do this with audio saved in a file, use
clpcnet.from_file. You can use clpcnet.to_file or
clpcnet.from_file_to_file to save the results to a file. To process many
files at once with multiprocessing, use clpcnet.from_files_to_files. To
perform vocoding from acoustic features, use clpcnet.from_features.
See the clpcnet API for full argument lists. Below is an example of
performing constant-ratio pitch-shifting with a ratio of 0.8 and
constant-ratio time-stretching with a ratio of 1.2.
import clpcnet
# Load audio from disk and resample to 16 kHz
audio_file = 'audio.wav'
audio = clpcnet.load.audio(audio_file)
# Perform constant-ratio pitch-shifting and time-stretching
generated = clpcnet.from_audio(audio, constant_stretch=1.2, constant_shift=0.8)
Command-line inference
The command-line interface for inference wraps the arguments of
clpcnet.from_files_to_files.
To run inference using a pretrained model, use the module entry point. For
example, to resynthesize speech without modification, use the following.
python -m clpcnet --audio_files audio.wav --output_files output.wav
See the command-line interface documentation for full list and description of arguments.
Replicating results
Here we demonstrate how to replicate the results of the paper "Neural
Pitch-Shifting and Time-Stretching with Controllable LPCNet" on the VCTK
dataset [3]. Time estimates are given using a 12-core 12-thread CPU with a 2.1 GHz
clock and a NVidia V100 GPU. First,
download VCTK so that the root
directory of VCTK is ./data/vctk (Time estimate: ~10 hours).
Partition the dataset
Partitions for each dataset are saved in
./clpcnet/assets/partition/. The VCTK partition file can be explicitly
recomputed as follows.
# Time estimate: < 10 seconds
python -m clpcnet.partition
Preprocess the dataset
# Compute YIN pitch and periodicity, BFCCs, and LPC coefficients.
# Time estimate: ~15 minutes
python -m clpcnet.preprocess
# Compute CREPE pitch and periodicity (Section 3.1).
# Time estimate: ~2.5 hours
python -m clpcnet.pitch --gpu 0
# Perform data augmentation (Section 3.2).
# Time estimate: ~17 hours
python -m clpcnet.preprocess.augment --gpu 0
All files are saved to ./runs/cache/vctk by default.
Train the model
# Time estimate: ~75 hours
python -m clpcnet.train --gpu 0
Checkpoints are saved to ./runs/checkpoints/clpcnet/. Log files are saved to ./runs/logs/clpcnet.
Evaluate the model
We perform evaluation on VCTK [3], as well as modified versions of DAPS [4] and
RAVDESS [5]. We modify DAPS by segmenting the dataset into sentences, and
providing transcripts for each sentence. To reproduce results on DAPS,
download our modified DAPS dataset on Zenodo
and decompress the tarball within data/. We modify RAVDESS by performing speech enhancement with
HiFi-GAN [6]. To reproduce results on RAVDESS, download
our modified RAVDESS dataset on Zenodo
and decompress the tarball within data/.
To create the DAPS partition file for evaluation, run the following.
# Partition the modified DAPS dataset
# Time estimate: < 10 seconds
python -m clpcnet.partition --dataset daps-segmented
We create two partition files for RAVDESS. The first is used for variable-ratio pitch-shifting and time-stretching, and creates pairs of audio files. The second samples files from those pairs for constant-ratio evaluation.
# Create pairs for variable-ratio evaluation from modified RAVDESS dataset
# Time estimate: ~1 hour
python -m clpcnet.partition --dataset ravdess-variable --gpu 0
# Sample files for constant-ratio evaluation
# Time estiamte: ~3 seconds
python -m clpcnet.partition --dataset ravdess-hifi
Constant-ratio objective evaluation
We perform constant-ratio objective evaluation on VCTK [3], as well as our modified DAPS [4] and RAVDESS [5] datasets.
# Prepare files for constant-ratio objective evaluation
# Files are saved to ./runs/eval/objective/constant/vctk/data/
# Time estimate:
# - vctk: ~2 minutes
# - daps-segmented: ~3 minutes
# - ravdess-hifi: ~3 minutes
python -m clpcnet.evaluate.gather --dataset <dataset> --gpu 0
# Evaluate
# Results are written to ./runs/eval/objective/constant/vctk/results.json
# Time estimate:
# - vctk: ~2 hours
# - daps-segmented: ~4.5 hours
# - ravdess-hifi: ~3.5 hours
python -m clpcnet.evaluate.objective.constant \
--checkpoint ./runs/checkpoints/clpcnet/clpcnet-103.h5 \
--dataset <dataset>
--gpu 0
<dataset> can be one of vctk (default), daps-segmented, or
ravdess-hifi.
Variable-ratio objective evaluation
We perform variable-ratio objective evaluation on our modified RAVDESS [5] dataset.
# Results are written to ./runs/eval/objective/variable/ravdess-hifi/results.json
# Time estimate: ~1.5 hours
python -m clpcnet.evaluate.objective.variable \
--checkpoint ./runs/checkpoints/clpcnet/clpcnet-103.h5 \
--gpu 0
Constant-ratio subjective evaluation
We perform constant-ratio subjective evaluation on our modified DAPS [4] dataset.
# Files are written to ./runs/eval/subjective/constant/daps-segmented
# Time estimate: ~10 hours
python -m clpcnet.evaluate.subjective.constant \
--checkpoint ./runs/checkpoints/clpcnet/clpcnet-103.h5 \
--gpu 0
Variable-ratio subjective evaluation
We perform variable-ratio subjective evaluation on our modified RAVDESS [5] dataset.
# Files are written to ./runs/eval/subjective/variable/ravdess-hifi
# Tim
