SkillAgentSearch skills...

Voxpopuli

A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

Install / Use

/learn @facebookresearch/Voxpopuli
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

VoxPopuli

https://aclanthology.org/2021.acl-long.80

A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation.

Overview

VoxPopuli provides

  • 400K hours of unlabelled speech data for 23 languages
  • 1.8K hours of transcribed speech data for 16 languages
  • 17.3K hours of speech-to-speech interpretation data for 15x15 directions
  • 29 hours of transcribed speech data of non-native English intended for research in ASR for accented speech (15 L2 accents)

The raw data is collected from 2009-2020 European Parliament event recordings. We acknowledge the European Parliament for creating and sharing these materials.

Detailed statistics

<details><summary>Unlabelled and transcribed data</summary><p>

| Language | Code | Unlabelled Hours (v1/v2) | Transcribed Hours | Transcribed Speakers | Transcribed Tokens | LM Tokens | |:---:|:---:|:---:|:---:|:---:|:---:|:---:| | English | En | 4.5K/24.1K | 543 | 1313 | 4.8M | 60.1M | | German | De | 4.5K/23.2K | 282 | 531 | 2.3M | 50.0M | | French | Fr | 4.5K/22.8K | 211 | 534 | 2.1M | 58.6M | | Spanish | Es | 4.4K/21.4K | 166 | 305 | 1.6M | 57.4M | | Polish | Pl | 4.5K/21.2K | 111 | 282 | 802K | 13.6M | | Italian | It | 4.6K/21.9K | 91 | 306 | 757K | 52.1M | | Romanian | Ro | 4.5K/17.9K | 89 | 164 | 739K | 10.3M | | Hungarian | Hu | 4.4K/17.7K | 63 | 143 | 431K | 13.0M | | Czech | Cs | 4.5K/18.7K | 62 | 138 | 461K | 13.5M | | Dutch | Nl | 4.5K/19.0K | 53 | 221 | 488K | 54.6M | | Finnish | Fi | 4.4K/14.2K | 27 | 84 | 160K | 34.5M | | Croatian | Hr | 2.7K/8.1K | 43 | 83 | 337K | 285K | | Slovak | Sk | 4.4K/12.1K | 35 | 96 | 270K | 13.3M | | Slovene | Sl | 4.4K/11.3K | 10 | 45 | 76K | 12.6M | | Estonian | Et | 4.3K/10.6K | 3 | 29 | 18K | 11.3M | | Lithuanian | Lt | 4.3K/14.4K | 2 | 21 | 10K | 11.5M | | Portuguese | Pt | 4.4K/17.5K | - | - | - | - | | Bulgarian | Bg | 4.3K/17.6K | - | - | - | - | | Greek | El | 4.4K/17.7K | - | - | - | - | | Latvian | Lv | 4.4K/13.1K | - | - | - | - | | Maltese | Mt | 4.4K/9.1K | - | - | - | - | | Swedish | Sv | 4.5K/16.3K | - | - | - | - | | Danish | Da | 4.3K/13.6K | - | - | - | - | | Total | | 100K/384K | 1791 | 4295 | 15M | 467M |

</p></details> <details><summary>Speech-to-speech interpretation data</summary><p>

| Source/Target | En | De | Fr | Es | Pl | It | Ro | Hu | Cs | Nl | Fi | Sk | Sl | Lt | Da | Total | |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | En | - | 463 | 427 | 441 | 432 | 461 | 457 | 382 | 427 | 400 | 442 | 433 | 434 | 398 | 370 | 6.0K | | De | 187 | - | 196 | 204 | 214 | 217 | 198 | 205 | 214 | 196 | 217 | 208 | 218 | 164 | 179 | 2.8K | | Fr | 169 | 187 | - | 187 | 172 | 197 | 195 | 144 | 170 | 158 | 168 | 168 | 156 | 139 | 134 | 2.3K | | Es | 130 | 138 | 135 | - | 118 | 148 | 128 | 93 | 118 | 115 | 124 | 114 | 108 | 83 | 86 | 1.6K | | Pl | 68 | 66 | 54 | 55 | - | 67 | 55 | 43 | 67 | 42 | 55 | 62 | 57 | 50 | 34 | 775 | | It | 69 | 77 | 76 | 79 | 72 | - | 75 | 61 | 68 | 64 | 71 | 66 | 70 | 53 | 60 | 961 | | Ro | 60 | 59 | 59 | 58 | 49 | 61 | - | 38 | 50 | 43 | 48 | 50 | 46 | 38 | 29 | 688 | | Hu | 30 | 38 | 25 | 27 | 29 | 30 | 27 | - | 27 | 20 | 31 | 29 | 26 | 21 | 18 | 378 | | Cs | 39 | 35 | 29 | 30 | 36 | 32 | 31 | 23 | - | 23 | 29 | 55 | 29 | 25 | 18 | 434 | | Nl | 31 | 43 | 35 | 29 | 27 | 38 | 24 | 25 | 25 | - | 32 | 25 | 23 | 19 | 25 | 401 | | Fi | 15 | 18 | 15 | 13 | 13 | 13 | 13 | 12 | 13 | 11 | - | 14 | 12 | 11 | 9 | 182 | | Hr | 31 | 27 | 27 | 24 | 27 | 28 | 24 | 22 | 24 | 22 | 24 | 26 | 37 | 21 | 20 | 384 | | Sk | 21 | 22 | 14 | 16 | 19 | 16 | 16 | 14 | 32 | 13 | 16 | - | 17 | 13 | 10 | 239 | | Sl | 6 | 6 | 4 | 5 | 5 | 6 | 5 | 4 | 5 | 4 | 5 | 6 | - | 4 | 3 | 68 | | Lt | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | - | 0 | 13 | | Total | 857 | 1.2K | 1.1K | 1.2K | 1.2K | 1.3K | 1.2K | 1.1K | 1.2K | 1.1K | 1.3K | 1.3K | 1.2K | 1.0K | 995 | 17.3K |

</p></details> <details><summary>Accented speech transcribed data</summary><p>

| Accent | Code | Transcribed Hours | Transcribed Speakers | |:---:|:---:|:---:|:---:| | Dutch | en_nl | 3.52 | 45 | | German | en_de | 3.52 | 84 | | Czech | en_cs | 3.30 | 26 | | Polish | en_pl | 3.23 | 33 | | French | en_fr | 2.56 | 27 | | Hungarian | en_hu | 2.33 | 23 | | Finnish | en_fi | 2.18 | 20 | | Romanian | en_ro | 1.85 | 27 | | Slovak | en_sk | 1.46 | 17 | | Spanish | en_es | 1.42 | 18 | | Italian | en_it | 1.11 | 15 | | Estonian | en_et | 1.08 | 6 | | Lithuanian | en_lt | 0.65 | 7 | | Croatian | en_hr | 0.42 | 9 | | Slovene | en_sl | 0.25 | 7 |

</p></details>

What's New

  • 2022-02-01: New labelled accented English speech data released.
  • 2022-01-15: New wav2vec 2.0 pre-trained models released.
  • 2021-07-26: New unlabelled data (additional 300K hours) released.
  • 2021-03-03: VoxPopuli released.

Getting Data

We provide raw audios as well as scripts to segment and align them with transcription/interpretation. The output format is Ogg Vorbis (16000Hz, 16-bit, mono-channel), which is supported by common libraries such as libsndfile and libsox (they have Python frontends by soundfile, torchaudio, etc.).

As the first step, clone this repo for the processing scripts

git clone https://github.com/facebookresearch/voxpopuli.git

and install required PyPI packages:

pip install -r requirements.txt

Unlabelled Data

First, download raw audios via

python -m voxpopuli.download_audios --root [ROOT] --subset [SUBSET]

which saves audios to ${ROOT}/raw_audios/[language]/[year]/[recording_id].ogg.

SUBSET specifies the data subset to download:

| --subset | # Languages | Hours | Years | Size | |:---:|:---:|:---:|:---:|:---:| | en, de, fr, es, pl, it, ro, hu, cs, nl, fi, hr, sk, sl, et, lt, pt, bg, el, lv, mt, sv or da | 1 | 2.7K-4.6K | 2009-2020 | 44G-75G | | en_v2, de_v2, fr_v2, es_v2, pl_v2, it_v2, ro_v2, hu_v2, cs_v2, nl_v2, fi_v2, hr_v2, sk_v2, sl_v2, et_v2, lt_v2, pt_v2, bg_v2, el_v2, lv_v2, mt_v2, sv_v2 or da_v2 | 1 | 8.1K-24.1K | 2009-2020 | 130G-385G | | 10k | 23 | 10K | 2019-2020 | 170G | | 100k | 23 | 100K | 2009-2020 | 1.7T | | 400k | 23 | 400K | 2009-2020 | 6.4T |

Then, segment these audios via

python -m voxpopuli.get_unlabelled_data --root [ROOT] --subset [SUBSET]

which outputs to ${ROOT}/unlabelled_data/[language]/[year]/[segment_id].ogg

Transcribed (ASR) Data

First, download raw audios via

python -m voxpopuli.download_audios --root [ROOT] --subset asr

which saves audios to ${ROOT}/raw_audios/original/[year]/[recording_id].ogg.

Then, segment these audios and align them with transcripts via

python -m voxpopuli.get_asr_data --root [ROOT] --lang [LANGUAGE]

which outputs

  • audios ${ROOT}/transcribed_data/[language]/[year]/[segment_id].ogg
  • per-split manifest (ID, transcript, speaker ID) ${ROOT}/transcribed_data/[language]/asr_[split].tsv

Accented transcribed data To retrieve the transcribed accented speech data, follow the above steps with --lang [LANGUAGE]_accented (e.g. --lang en_accented). Note that the accented speech data is only composed of a test set for now.

Speech-to-Speech Interpretation Data

First, follow the instructions above to set up ASR data (source audios and transcripts).

Then, download target audios via

python -m voxpopuli.download_audios --root [ROOT] --subset [TARGET_LANGUAGE]

which saves audios to ${ROOT}/raw_audios/[target_language]/[year]/[recording_id].ogg.

Finally, segment these audios and match them with source ones via

python -m voxpopuli.get_s2s_data --root [ROOT] --source-lang [SOURCE_LANGUAGE] --target-lang [TARGET_LANGUAGE]

which outputs

  • target audios ${ROOT}/transcribed_data/[language]/[target_language]/[year]/[segment_id].ogg
  • manifest (source ID, transcript, speaker ID, target ID) ${ROOT}/transcribed_data/[language]/[target_language]/s2s.tsv

We also human-transcribe part of the target audios (for English, French and Spanish only) to allow more accurate alignments. To use them instead of machine transcriptions in the alignments, add --use-annotated-target to the command line.

Language Modeling (LM) Data

We combine VoxPopuli transcripts and text data from Europarl for LM training.

Download VoxPopuli and Europarl text data, process the raw text and generate the vocabulary via

python -m voxpopuli.get_lm_data --root [ROOT] --lang [LANGUAGE]

which outputs

  • sentences ${ROOT}/lm_data/[language]/sentences.txt
  • vocabulary ${ROOT}/lm_data/[language]/vocabulary.txt

To train an n-gram LM with KenLM, run

${KENLM_PATH}/lmplz -o ${n} --limit_vocab_file [OUT_VOCAB_FILE] < [OUT_TEXT_FILE] > ${n}gram_lm.arpa
${KENLM_PATH}/build_binary ${n}gram_lm.arpa ${n}gram_lm.bin

Pre-trained Models

wav2vec 2.0

We provide pre-trained wav2vec 2.0 models (implemented in fairseq and wav2letter/flashlight) for downstream speech tasks. Each language is covered by a monolingual Base model and multilingual Large models that combine languages in the same family or all languages. See also XLS-R for larger-scale (up to 2B) multilingual models trained on VoxPopuli (400K hours).

<details><summary><b>Download</b></summary><p>

| Language(s) | Family | PT Hours | Base Model (95M) |

View on GitHub
GitHub Stars566
CategoryEducation
Updated4d ago
Forks70

Languages

Python

Security Score

80/100

Audited on Mar 24, 2026

No findings