Controllable LPCNet

Official repository for the paper "Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet" [1]. Performs pitch-shifting and time-stretching of speech recordings. Audio examples can be found here. The original LPCNet [2] code can be found here. If you use this code in an academic publication, please cite our paper.

Installation
- Without Docker
- With Docker
Inference
- Library inference
- Command-line inference
Replicating results
API
CLI
Citation
References

Installation

With Docker

Docker installation assumes recent versions of Docker and NVidia Docker are installed.

In order to perform variable-ratio time-stretching, you must first download HTK 3.4.0 (see download instructions), which is used for forced phoneme alignment. Note that you only have to download HTK. You do not have to install it locally. HTK must be downloaded to within this directory in order to be considered part of the build context.

Next, build the image.

docker build --tag clpcnet --build-arg HTK=<path_to_htk> .

Now we can run a command within the Docker image.

docker run -itd --rm --name "clpcnet" --shm-size 32g --gpus all \
  -v <absolute_path_of_runs_directory>:/clpcnet/runs \
  -v <absolute_path_of_data_directory>:/clpcnet/data \
  clpcnet:latest \
  <command>

Where <command> is the command you would like to execute within the container, prefaced with the correct Python path within the Docker image (e.g., /opt/conda/envs/clpcnet/bin/python -m clpcnet.train --gpu 0).

Without Docker

Installation assumes we start from a clean install of Ubuntu 18 or 20 with a recent CUDA driver and conda.

Install the apt-get dependencies.

sudo apt-get update && \
sudo apt-get install -y \
    ffmpeg \
    gcc-multilib \
    libsndfile1 \
    sox

Build the C preprocessing code

make

Create a new conda environment and install conda dependencies

conda create -n clpcnet python=3.7 cudatoolkit=10.0 cudnn=7.6 -y
conda activate clpcnet

Finally, install the Python dependencies

pip install -e .

If you would like to perform variable-ratio time-stretching, you must also download and install HTK 3.4.0 (see download instructions), which is used for forced phoneme alignment.

Inference

clpcnet can be used as a library (via import clpcnet) or as an application accessed via the command-line.

Library inference

To perform pitch-shifting or time-stretching on audio already loaded into memory, use clpcnet.from_audio. To do this with audio saved in a file, use clpcnet.from_file. You can use clpcnet.to_file or clpcnet.from_file_to_file to save the results to a file. To process many files at once with multiprocessing, use clpcnet.from_files_to_files. To perform vocoding from acoustic features, use clpcnet.from_features. See the clpcnet API for full argument lists. Below is an example of performing constant-ratio pitch-shifting with a ratio of 0.8 and constant-ratio time-stretching with a ratio of 1.2.

import clpcnet

# Load audio from disk and resample to 16 kHz
audio_file = 'audio.wav'
audio = clpcnet.load.audio(audio_file)

# Perform constant-ratio pitch-shifting and time-stretching
generated = clpcnet.from_audio(audio, constant_stretch=1.2, constant_shift=0.8)

Command-line inference

The command-line interface for inference wraps the arguments of clpcnet.from_files_to_files. To run inference using a pretrained model, use the module entry point. For example, to resynthesize speech without modification, use the following.

python -m clpcnet --audio_files audio.wav --output_files output.wav

See the command-line interface documentation for full list and description of arguments.

Replicating results

Here we demonstrate how to replicate the results of the paper "Neural Pitch-Shifting and Time-Stretching with Controllable LPCNet" on the VCTK dataset [3]. Time estimates are given using a 12-core 12-thread CPU with a 2.1 GHz clock and a NVidia V100 GPU. First, download VCTK so that the root directory of VCTK is ./data/vctk (Time estimate: ~10 hours).

Partition the dataset

Partitions for each dataset are saved in ./clpcnet/assets/partition/. The VCTK partition file can be explicitly recomputed as follows.

# Time estimate: < 10 seconds
python -m clpcnet.partition

Preprocess the dataset

# Compute YIN pitch and periodicity, BFCCs, and LPC coefficients.
# Time estimate: ~15 minutes
python -m clpcnet.preprocess

# Compute CREPE pitch and periodicity (Section 3.1).
# Time estimate: ~2.5 hours
python -m clpcnet.pitch --gpu 0

# Perform data augmentation (Section 3.2).
# Time estimate: ~17 hours
python -m clpcnet.preprocess.augment --gpu 0

All files are saved to ./runs/cache/vctk by default.

Train the model

# Time estimate: ~75 hours
python -m clpcnet.train --gpu 0

Checkpoints are saved to ./runs/checkpoints/clpcnet/. Log files are saved to ./runs/logs/clpcnet.

Evaluate the model

We perform evaluation on VCTK [3], as well as modified versions of DAPS [4] and RAVDESS [5]. We modify DAPS by segmenting the dataset into sentences, and providing transcripts for each sentence. To reproduce results on DAPS, download our modified DAPS dataset on Zenodo and decompress the tarball within data/. We modify RAVDESS by performing speech enhancement with HiFi-GAN [6]. To reproduce results on RAVDESS, download our modified RAVDESS dataset on Zenodo and decompress the tarball within data/.

To create the DAPS partition file for evaluation, run the following.

# Partition the modified DAPS dataset
# Time estimate: < 10 seconds
python -m clpcnet.partition --dataset daps-segmented

We create two partition files for RAVDESS. The first is used for variable-ratio pitch-shifting and time-stretching, and creates pairs of audio files. The second samples files from those pairs for constant-ratio evaluation.

# Create pairs for variable-ratio evaluation from modified RAVDESS dataset
# Time estimate: ~1 hour
python -m clpcnet.partition --dataset ravdess-variable --gpu 0

# Sample files for constant-ratio evaluation
# Time estiamte: ~3 seconds
python -m clpcnet.partition --dataset ravdess-hifi

Constant-ratio objective evaluation

We perform constant-ratio objective evaluation on VCTK [3], as well as our modified DAPS [4] and RAVDESS [5] datasets.

# Prepare files for constant-ratio objective evaluation
# Files are saved to ./runs/eval/objective/constant/vctk/data/
# Time estimate:
#  - vctk: ~2 minutes
#  - daps-segmented: ~3 minutes
#  - ravdess-hifi: ~3 minutes
python -m clpcnet.evaluate.gather --dataset <dataset> --gpu 0

# Evaluate
# Results are written to ./runs/eval/objective/constant/vctk/results.json
# Time estimate:
#  - vctk: ~2 hours
#  - daps-segmented: ~4.5 hours
#  - ravdess-hifi: ~3.5 hours
python -m clpcnet.evaluate.objective.constant \
    --checkpoint ./runs/checkpoints/clpcnet/clpcnet-103.h5 \
    --dataset <dataset>
    --gpu 0

<dataset> can be one of vctk (default), daps-segmented, or ravdess-hifi.

Variable-ratio objective evaluation

We perform variable-ratio objective evaluation on our modified RAVDESS [5] dataset.

# Results are written to ./runs/eval/objective/variable/ravdess-hifi/results.json
# Time estimate: ~1.5 hours
python -m clpcnet.evaluate.objective.variable \
    --checkpoint ./runs/checkpoints/clpcnet/clpcnet-103.h5 \
    --gpu 0

Constant-ratio subjective evaluation

We perform constant-ratio subjective evaluation on our modified DAPS [4] dataset.

# Files are written to ./runs/eval/subjective/constant/daps-segmented
# Time estimate: ~10 hours
python -m clpcnet.evaluate.subjective.constant \
    --checkpoint ./runs/checkpoints/clpcnet/clpcnet-103.h5 \
    --gpu 0

Variable-ratio subjective evaluation

We perform variable-ratio subjective evaluation on our modified RAVDESS [5] dataset.

# Files are written to ./runs/eval/subjective/variable/ravdess-hifi
# Tim

Clpcnet

Install / Use

README