wav2vec 2.0

callhome test | ar | en | ma | ja | ge | sp |---|---|---|---|---|---|--- Mix-Model | 56.47 | 43.93 | 50.13 | 51.75 | 53.38 | 42.65 Transformer + hkust | 48.35 | 33.77 | 37.62 | 36.99 | 44.98 | 51.54 lter/char | 45.53 | 24.05 | 43.05 | 38.4 | 41.83 | 50.41 subword+char | 55.04 | 27.58 | 37.39 | 41.4 | 46.70 | 53.4 +lm | 44.67 | 23.92 | 33.57 | 39.02 | 40.66 | 46.20

wav2vec 2.0 learns speech representations on unlabeled data as described in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (Baevski et al., 2020).

Pre-trained models

Model | Finetuning split | Dataset | Model |---|---|---|--- Wav2Vec 2.0 Base | No finetuning | Librispeech | download Wav2Vec 2.0 Base | 10 minutes | Librispeech | download Wav2Vec 2.0 Base | 100 hours | Librispeech | download Wav2Vec 2.0 Base | 960 hours | Librispeech | download Wav2Vec 2.0 Large | No finetuning | Librispeech | download Wav2Vec 2.0 Large | 10 minutes | Librispeech | download Wav2Vec 2.0 Large | 100 hours | Librispeech | download Wav2Vec 2.0 Large | 960 hours | Librispeech | download Wav2Vec 2.0 Large (LV-60) | No finetuning | Libri-Light | download Wav2Vec 2.0 Large (LV-60) | 10 minutes | Libri-Light + Librispeech | download Wav2Vec 2.0 Large (LV-60) | 100 hours | Libri-Light + Librispeech | download Wav2Vec 2.0 Large (LV-60) | 960 hours | Libri-Light + Librispeech | download

Training a new model with the CLI tools

Given a directory containing wav files to be used for pretraining (we recommend splitting each file into separate file 10 to 30 seconds in length)

Prepare training data manifest:

$ext should be set to flac, wav, or whatever format your dataset happens to use that soundfile can read.

$valid should be set to some reasonable percentage (like 0.01) of training data to use for validation. To use a pre-defined validation set (like dev-other from librispeech), set to it 0 and then overwrite valid.tsv with a separately pre-processed manifest file.

$ python examples/wav2vec/wav2vec_manifest.py /path/to/waves --dest /manifest/path --ext $ext --valid-percent $valid

Train a wav2vec 2.0 base model:

This configuration was used for the base model trained on the Librispeech dataset in the wav2vec 2.0 paper

Note that this was tested with pytorch 1.4.0 and the input is expected to be single channel, sampled at 16 kHz

$ python train.py --distributed-world-size 64 --distributed-port $PORT /manifest/path \
--save-dir /model/path fp16 --num-workers 6 --task audio_pretraining --criterion wav2vec --arch wav2vec2 \
--log-keys '["prob_perplexity","code_perplexity","temp"]' --quantize-targets --extractor-mode default \
--conv-feature-layers '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2' --final-dim 256 --latent-vars 320 \
--latent-groups 2 --latent-temp '(2,0.5,0.999995)' --infonce --optimizer adam \
--adam-betas '(0.9,0.98)' --adam-eps 1e-06 --lr-scheduler polynomial_decay --total-num-update 400000 \
--lr 0.0005 --warmup-updates 32000 --mask-length 10 --mask-prob 0.65 --mask-selection static --mask-other 0 \
--encoder-layerdrop 0.05 --dropout-input 0.1 --dropout-features 0.1 --feature-grad-mult 0.1 \
--loss-weights '[0.1, 10]' --conv-pos 128 --conv-pos-groups 16 --num-negatives 100 --cross-sample-negatives 0 \
--max-sample-size 250000 --min-sample-size 32000 --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--max-tokens 1400000 --max-update 400000 --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d

Note: you can simulate 64 GPUs by using k GPUs and setting --update-freq 64/k

Train a wav2vec 2.0 large model:

This configuration was used for the large model trained on the Libri-light dataset in the wav2vec 2.0 paper

$ python train.py --distributed-world-size 128 --distributed-port $PORT /manifest/path \
--save-dir /model/path --fp16 --num-workers 6 --task audio_pretraining --criterion wav2vec --arch wav2vec2 \
--log-keys '["prob_perplexity","code_perplexity","temp"]' --quantize-targets --extractor-mode default \
--conv-feature-layers '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] * 2' --final-dim 768 --latent-vars 320 \
--latent-groups 2 --latent-temp '(2.0,0.1,0.999995)' --infonce --optimizer adam \
--adam-betas '(0.9,0.98)' --adam-eps 1e-06 --lr-scheduler polynomial_decay --total-num-update 600000 \
--lr 0.0003 --warmup-updates 32000 --mask-length 10 --mask-prob 0.65 --mask-selection static --mask-other 0 \
--encoder-layerdrop 0.0 --dropout-input 0.1 --dropout-features 0.1 --feature-grad-mult 0.03 \
--loss-weights '[0.1, 10]' --conv-pos 128 --conv-pos-groups 16 --encoder-layers 24 --encoder-embed-dim 1024 \
--encoder-ffn-embed-dim 4096 --encoder-attention-heads 16 --num-negatives 100 --cross-sample-negatives 0 \
--max-sample-size 320000 --min-sample-size 32000 --dropout 0.0 --attention-dropout 0.1 --weight-decay 0.01 \
--max-tokens 1200000 --max-update 600000 --skip-invalid-size-inputs-valid-test --ddp-backend no_c10d

Note: you can simulate 128 GPUs by using k GPUs and setting --update-freq 128/k

Fine-tune a pre-trained model with CTC:

Fine-tuning a model requires parallel audio and labels file, as well as a vocabulary file in fairseq format. A letter vocabulary can be downloaded here. An example script that generates labels for the Librispeech dataset from the tsv file produced by wav2vec_manifest.py can be used as follows:

split=train
$ python libri_labels.py /path/to/tsv --output-dir /output/dir --output-name $split

Fine-tuning on 100h of Librispeech with letter targets:

valid_subset=dev_other
python train.py --distributed-world-size 24 --distributed-port $PORT /path/to/training_data --save-dir /model/path --fp16 \
--wer-args '("/path/to/lm/4-gram.bin","/path/to/lexicon",2,-1)' \
--post-process letter --valid-subset $valid_subset --no-epoch-checkpoints --best-checkpoint-metric wer --num-workers 4 \
--max-update 80000 --sentence-avg --task audio_pretraining --arch wav2vec_ctc --w2v-path /path/to/pretrained/model \
--labels ltr --apply-mask --mask-selection static --mask-other 0 --mask-length 10 --mask-prob 0.5 --layerdrop 0.1 \
--mask-channel-selection static --mask-channel-other 0 --mask-channel-length 64 --mask-channel-prob 0.5 --zero-infinity \
--feature-grad-mult 0.0 --freeze-finetune-updates 10000 --validate-after-updates 10000 --optimizer adam \
--adam-betas '(0.9, 0.98)' --adam-eps 1e-08 --lr 2e-05 --lr-scheduler tri_stage --warmup-steps 8000 --hold-steps 32000 \
--decay-steps 40000 --final-lr-scale 0.05 --final-dropout 0.0 --dropout 0.0 --activation-dropout 0.1 --criterion ctc \
--attention-dropout 0.0 --max-tokens 1280000 --seed 2337 --log-format json --log-interval 500 --ddp-backend no_c10d

Note: you can simulate 24 GPUs by using k GPUs and setting --update-freq 24/k

Decoding with a language model during training requires wav2letter python bindings. Alternatively, simply omit the --wer-args flag.

Evaluating a CTC model:

Evaluating a CTC model with a language model requires wav2letter python bindings to be installed.

Fairseq transformer language model used in the wav2vec 2.0 paper can be obtained from the wav2letter model repository. Be sure to upper-case the language model vocab after downloading it.

Letter dictionary for pre-trained models can be found here.

Next, run the evaluation command:

$subset=dev_other
python examples/speech_recognition/infer.py /checkpoint/abaevski/data/speech/libri/10h/wav2vec/raw --task audio_pretraining \
--nbest 1 --path /path/to/model --gen-subset $subset --results-path /path/to/save/results/for/sclite --w2l-decoder kenlm \
--lm-model /path/to/kenlm.bin --lm-weight 2 --word-score -1 --sil-weight 0 --criterion ctc --labels ltr --max-tokens 4000000 \
--post-process letter

To get raw numbers, use --w2l-decoder viterbi and omit the lexicon. To use the transformer language model, use --w2l-decoder fairseqlm.

wav2vec

Example to train a wav2vec model as described in wav2vec: Unsupervised Pre-training for Speech Recognition (Schneider et al., 2019).

Pre-trained models

Description | Dat

Wav2vec

Install / Use

README

wav2vec 2.0

Pre-trained models

Training a new model with the CLI tools

Prepare training data manifest:

Train a wav2vec 2.0 base model:

Train a wav2vec 2.0 large model:

Fine-tune a pre-trained model with CTC:

Evaluating a CTC model:

wav2vec

Pre-trained models