GigaSpeech

This is the official repository of the GigaSpeech dataset. For details of how we created the dataset, please refer to our Interspeech paper: "GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio". Preprint available on arxiv.

GigaSpeech version: 1.0.0 (07/05/2021)

Download

Step 1: Please fill out the Google Form here
Step 2:
- Option A: Follow the instructions in replied email from SpeechColab to get the raw release of GigaSpeech
- Option B: Refer to GigaSpeech On HuggingFace to get a pre-processed version of GigaSpeech via HuggingFace.

Leaderboard

| Contributor| Toolkit | Train Recipe | Train Data | Inference |Dev/Test WER | |:---------------|:------------------|:------------------|:------------------|:------------------|:------------------:| ||||| | Baseline | Athena | Transformer-AED + RNNLM | GigaSpeech v1.0.0 XL | model example | 13.60 / 12.70 | | Baseline | Espnet | Conformer/Transformer-AED | GigaSpeech v1.0.0 XL | model example | 10.90 / 10.80 | | Baseline | Kaldi | Chain + RNNLM | GigaSpeech v1.0.0 XL | model example | 14.78 / 14.84 | | Baseline | Pika | RNN-T | GigaSpeech v1.0.0 XL | model example | 12.30 / 12.30 | ||||| | Johns Hopkins University | Icefall | Transducer: Zipformer encoder + Embedding decoder | GigaSpeech v1.0.0 XL | model example | 10.25 / 10.38 | | Johns Hopkins University | Icefall | Pruned Stateless RNN-T | GigaSpeech v1.0.0 XL | model example | 10.40 / 10.51 | | Johns Hopkins University | Icefall | Conformer CTC + ngram & attention rescoring | GigaSpeech v1.0.0 XL | model example | 10.47 / 10.58 | | Mobvoi | Wenet | Joint CTC/AED(U2++) | GigaSpeech v1.0.0 XL | model example | 10.70 / 10.60 | | ByteDance AI Lab | NeurST | Transformer-AED | GigaSpeech v1.0.0 XL | model example | 11.89 / 11.60 |

Dataset

Audio Source

Language: English
33,000+ hours for unsupervised/semi-supervised learning
10,000 hours with high-quality human transcriptions for supervised learning

| Audio Source | Transcribed Hours | Total Hours | Acoustic Condition | |:---------------|:-----------------:|:--------------:|:-------------------| | Audiobook | 2,655 | 11,982 | <li>Reading</li><li>Various ages and accents</li> | | Podcast | 3,498 | 9,254 | <li>Clean or background music</li><li>Indoor</li><li>Near-field</li><li>Spontaneous</li><li>Various ages and accents</li>| | YouTube | 3,845 | 11,768 | <li>Clean and noisy</li><li>Indoor and outdoor</li><li>Near- and far-field</li><li>Reading and spontaneous</li><li>Various ages and accents</li> | | total | 10,000 | 33,005 ||

Transcribed Training Subsets

| Subset | Hours | Remarks | |:---------------:|:-------------:|:-------------| | XS | 10 | System building and debugging | | S | 250 | Quick research experiments | | M | 1,000 | Large-scale research experiments | | L | 2,500 | Medium-scale industrial experiments | | XL | 10,000 | Large-scale industrial experiments |

Larger subsets are supersets of smaller subsets, e.g., subset L contains all the data from subset M.

Transcribed Evaluation Subsets

| Subset | Hours | Remarks | |:------:|:-----:|:--------| | Dev | 12 | Randomly selected from the crawled Podcast and YouTube Data | | Test | 40 | Part of the subset was randomly selected from the crawled Podcast and YouTube data; part of it was manually collected through other channels to have better coverage. |

Evaluation subsets are annotated by professional human annotators

Data Preparation Guidelines

We maintain data preparation scripts for different speech recognition toolkits in this repository so that when we update the dataset (note, this is an evolving dataset), we don't have to update the scripts in the downstream toolkits. Data preparation scripts for different speech recognition toolkits are maintained in the toolkits/ folder, e.g., toolkits/kaldi for the Kaldi speech recognition toolkit.

Preparation Scripts

To use the data preparation scripts, do the following in your toolkit (here we use Kaldi as an example)

git clone https://github.com/SpeechColab/GigaSpeech.git

cd GigaSpeech
utils/download_gigaspeech.sh /disk1/audio_data/gigaspeech
toolkits/kaldi/gigaspeech_data_prep.sh --train-subset XL /disk1/audio_data/gigaspeech ../data
cd ..

Metadata walkthrough

We save all the metadata information to a single JSON file named GigaSpeech.json. Below is a snip of this file:

{
  "dataset": "GigaSpeech",
  "language": "EN",
  "version": "v1.0.0",
  ... ...
  "audios": [
    {
      "title": "The Architect of Hollywood",
      "url": "https://99percentinvisible.org/episode/the-architect-of-hollywood/download",
      "path": "audio/podcast/P0001/POD0000000025.opus",
      ... ...
      "segments": [
        {
          "sid": "POD0000000025_S0000103",
          "speaker": "N/A",
          "begin_time": 780.31,
          "end_time": 783.13,
          "text_tn": "FOUR O'CLOCK TOMORROW AFTERNOON <COMMA> SAID WILLIAMS <PERIOD>",
          "subsets": [
            "{XL}",
            "{L}"
          ]
        },
        ... ...
      ],
      ... ...
    },
    ... ...
  ]
}

To use the corpus, users are expected to extract the relevant information from GigaSpeech.json. For example, for the speech recognition task, one should first follow the "audios" entry, and work out a list of audio files. One can then follow the "url" entry to download the original audio file, or "path" if preprocessed audio files have been downloaded to the disk. After that, for each audio file, one can follow the "segments" entry, and work out the trainable audio segments, as well as their corresponding transcripts. Of course, we also have various supplementary entries, such as "subsets", "md5", which will also be helpful for your task.

The metadata file GigaSpeech.json is version controlled, and is supposed to get updated over the time. In future releases, we plan to add speaker information to the metadata file, so that it will be suitable for speaker identification/verification tasks. We also plan to add more data from different sources to increase the diversity.

We also provide some convenient command-line tools based on jq, e.g., utils/ls_audio.sh, utils/show_segment_info.sh, utils/ls_md5.sh.

Audio Processing

Resampling: GigaSpeech audio files are resampled at 16 kHz sampling rate, and are compressed with the Opus format. The Opus compression, however, does not depend on the input sample rate; it uses the bandwidth instead. Timestamps are measured in 48 kHz units even if the full bandwidth is not used. Likewise, the output sample rate may be freely chosen. For example, audio can be input at 16 kHz yet be set to encode only narrowband audio. For this reason, we recommend our users to explicitly resample the decoded audio to 16 kHz sampling rate before training & testing. For opus-to-wav conversion, refer to our exampler tool utils/opus_to_wav.py

Text Pre-Processing

Punctuations: We keep 4 punctuations in the normalized text (see the text_tn entry in GigaSpeech.json)
```
<COMMA>
<PERIOD>
<QUESTIONMARK>
<EX
```

GigaSpeech

Install / Use

README

GigaSpeech

Download

Leaderboard

Dataset

Audio Source

Transcribed Training Subsets

Transcribed Evaluation Subsets

Data Preparation Guidelines

Preparation Scripts

Metadata walkthrough

Audio Processing

Text Pre-Processing