SGEM

Official PyTorch implementation of SGEM: Test-Time Adaptation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization (INTERSPEECH 2023 Oral Presentation)

Generate Convert Improve

Install / Use

/learn @drumpt/SGEM

About this skill

Quality Score

0/100

README

SGEM: Test-Time Adaptation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization (INTERSPEECH 2023 Oral Presentation)

Introduction

This repository contains the official PyTorch implementation of the following paper:

SGEM: Test-Time Adaptation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization<br> Changhun Kim, Joonhyung Park, Hajin Shim and Eunho Yang<br> Conference of the International Speech Communication Association (INTERSPEECH), 2023, (Oral Presentation, 348/2293=15.18%)

Abstract: Automatic speech recognition (ASR) models are frequently exposed to data distribution shifts in many real-world scenarios, leading to erroneous predictions. To tackle this issue, an existing test-time adaptation (TTA) method has recently been proposed to adapt the pre-trained ASR model on unlabeled test instances without source data. Despite decent performance gain, this work relies solely on naive greedy decoding and performs adaptation across timesteps at a frame level, which may not be optimal given the sequential nature of the model output. Motivated by this, we propose a novel TTA framework, dubbed SGEM, for general ASR models. To treat the sequential output, SGEM first exploits beam search to explore candidate output logits and selects the most plausible one. Then, it utilizes generalized entropy minimization and negative sampling as unsupervised objectives to adapt the model. SGEM achieves state-of-the-art performance for three mainstream ASR models under various domain shifts.

Environmental Setup

conda create -y -n sgem python=3.7
conda activate sgem
pip install -r requirements.txt

Datasets

LibriSpeech
- You can get test-other.tar.gz in LibriSpeech using the link above.
CHiME-3
- You need to manually download CHiME-3 dataset using the link above with a standard Linguistic Data Consortium account.
TED-LIUM 2
- You can get TED-LIUM 2 dataset using the link above.
- You also need to preprocess the data with data/preprocess_ted.py and data/preprocess_ted.sh.
CommonVoice
- You can get Common Voice Corpus 5.1 dataset using the link above.
Valentini
- You can get noisy_testset_wav.zip and testset_txt.zip in TED-LIUM 2 dataset using the link above.
L2-Arctic
- You can get L2-Arctic dataset using the link above.
- Speakers who were utilized for each native language are as follows:
Language | Speaker --- | --- Arabic | SKA Mandarin | BWC Hindi | RRBI Korean | HKK Spanish | EBVS Vietnamese | PNV
MS-SNSD
- All background noises used in the paper are included in res folder. (res/*.wav)
- Set speech_dir and snr_lower in conf/noisyspeech_synthesizer.cfg.
- You can make synthetic distribution shift datasets with the following command:
```
python corpus/noisyspeech_synthesizer.py
```

Pre-trained Models

CTC-based Model
- CTC-based model will be automatically downloaded if you set asr as facebook/wav2vec2-base-960h.

Conformer

You need to download conformer by your own using following command:

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_conformer_ctc_small_ls/versions/1.0.0/zip -P pretrained_models

Transducer

You need to download transducer by your own using following command:

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_conformer_transducer_small/versions/1.6.0/zip -P pretrained_models

4-gram Language Model for CTC-based Model

You need to download language by your own using following command:

git lfs install
git clone https://huggingface.co/patrickvonplaten/wav2vec2-base-100h-with-lm pretrained_models/wav2vec2-base-100h-with-lm

Run

You can run main.py using the command below:

python main.py \
    --config-name [CONFIG.YAML] \
    dataset_name=[DATASET_NAME] \
    dataset_dir=[DATASET_DIR] \

Currently available parameters are as follows:

Parameter | Value --- | --- CONFIG.YAML | config.yaml, config_{sgem|suta}_{ctc|conformer|transducer}.yaml DATASET_NAME | librispeech, chime, ted, commonvoice, valentini, l2arctic

Contact

If you have any questions or comments, feel free to contact us via changhun.kim@kaist.ac.kr.

Citation

@inproceedings{kim2023sgem,
  title={{SGEM}: Test-Time Adaptation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization},
  author={Kim, Changhun and Park, Joonhyung and Shim, Hajin and Yang, Eunho},
  booktitle={Conference of the International Speech Communication Association (INTERSPEECH)},
  year={2023}
}

Acknowledgement

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) 
(No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST)).

Related Skills

node-connect

354.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

112.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

354.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

354.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。