Sequential2.0
A Self Supervised Speech Translation Model Based on Deberta and Squeezeformer using Pseudo languages
Install / Use
/learn @961241279/Sequential2.0README
Sequential2.0: A Self Supervised Speech Translation Model Based on Deberta and Squeezeformer Using Pseudo Languages
Authors:: FelHong Liu, RongCai Zhao
Paper link:
Model Checkpoints
Pre-trained Models
| Model | Pre-training updates | Dataset | Link | | --------------------------------- | -------------------- | ---------------- | ----------------------------------------------- | | Sequtial2.0 (from HuBERT-base) | 400K + 25K | LibriSpeech 960h | Download | | Sequtial2.0 (from HuBERT-base) | 400K + 100K | LibriSpeech 960h | Download | | Sequtial2.0 (from fat_en_zh) | 400K + 25K | AiShell 10Kh | Download | | Sequtial2.0 (from fat_en_zh) | 400K + 100K | AiShell 10Kh | Download |
Fine-tuned Models
| Model | Pre-training updates | Finetuning split | Link | | -------------------------------- | -------------------- | ------------------ | ----------------------------------------------- | | Sequtial2.0 (from HuBERT-base) | 400K + 25K | LibriSpeech 10h | Download | | Sequtial2.0 (from HuBERT-base) | 400K + 100K | LibriSpeech 100h | Download | | Sequtial2.0 (from fat_en_zh) | 400K + 25K | ted_en_zh 10h | Download | | Sequtial2.0 (from fat_en_zh) | 400K + 100K | ted_en_zh 100h | Download |
Pre-trained k-means Models for Psuedo Characters
| Number of Clusters | Link | | ------------------ | ----------------------------------------------- | | 25 | Download | | 100 | Download | | 500 | Download |
Pre-trained BPE model for Psuedo Subwords
| Number of Clusters | Number of Subwords | Link | | ------------------ | ------------------ | ----------------------------------------------- | | 25 | 1000 | Download | | 25 | 3000 | Download | | 25 | 10000 | Download | | 25 | 30000 | Download | | 100 | 3000 | Download | | 100 | 10000 | Download | | 100 | 30000 | Download | | 500 | 3000 | Download | | 500 | 10000 | Download | | 500 | 30000 | Download |
Usage
Dependency
torch==1.9.0+cu111
torchaudio==0.9.0
tqdm==4.62.3
hydra-core==1.0.7
omegaconf==2.0.6
einops==0.3.0
fire==0.4.0
fairseq==1.0.0a0+bba000d
paddlepaddle==2.4.1
paddlespeech==1.4.1
Installation
git clone git@github.com:961241279/Sequential2.0.git
cd wav2seq
pip install -e .
Download the manifests generated by Paddle
- Please download the files from: manifests
- unzipped and put these files under "data/"
Creatining Psuedo Subword Tokens
- Create wav2vec style manifest files
Please set
LIBRISPEECH_PATHto your librispeech folder which contains three subfolderstrain-clean-100,train-clean-360,train-other-500.
#librispeech:
mkdir -p manifest/librispeech/train-960
python -m examples.wav2vec.wav2vec_manifest LIBRISPEECH_PATH --dest manifest/librispeech/train-960 --ext flac --valid-percent 0.01 --path-must-contain train
#aishell:
python utils/aishell.py --tgt-dir=YOUR_DATASET_DIR --src-dir=manifest/aishell
#ted_en_zh:
python utils/ted_en_zh.py --tgt-dir=YOUR_DATASET_DIR --src-dir=
manifest/ted_en_zh
- Train k-means model and get cluster indices.
Please make sure that you have download pre-trained hubert-base checkpoint at
HUBERT_PATH. Notably, this step requires a GPU for feature extraction and 64GB main memory for k-means training. Extracting HuBERT features takes about 15 minutes, training k-means may take about an hour, dumping the cluster ids of the whole Librispeech 960h data takes more than two hours.
HUBERT_PATH="save/pretrained/hubert_base_ls960.pt"
FAT_PATH="save/pretrained/fat_en_zh.pdparams"
mkdir -p save/pretrained
if ! [ -f $HUBERT_PATH ]; then
wget https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt -O $HUBERT_PATH
fi
if ! [ -f $FAT_PATH ]; then
wget
https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/paddle.98.pdparams --no-check-certificate -O $FAT_PATH
fi
bash scripts/pl/extract-features.sh $HUBERT_PATH 9 2 2 500 False
bash scripts/pl/extract-features.sh $FAT_PATH 9 2 2 500 True
where 9, 2, 2, 500 means that we use the 9-th layer of HuBERT, kernel size 2 and stride size 2 for average pooling, and 500 custers in k-means.
- Training BPE model and create pseudo subword tokens.
bash scripts/pl/create-pseudo-language.sh labels/hubert_base-l9-k2s2-fp16-ls0.1/c500 30000
bash scripts/pl/create-pseudo-language.sh labels/fat-l9-k2s2-fp16-ls0.1/c500 30000
Pre-training Sequntial2.0
bash scripts/sequntial2.0-pt.sh wav2seq-hubert-base-ls960
bash scripts/sequntial2.0-pt.sh wav2seq-fat-base-ls960
Fine-tuning Sequntial2.0 on LibriSpeech
To fine-tune a pretrained checkpoint on librispeech with 10h data. Please use this command.
bash scripts/sequntial2.0-ft-ls.sh $pretrained_ckpt ft-ls-10h
where $pretrained_ckpt is your pretrained checkpoint.
With 100h supervised data, please use this command.
bash scripts/sequntial2.0-ft-ls.sh $pretrained_ckpt ft-ls-100h
Please make sure that your manifest files are stored in manifest/librispeech.
We provide our manifest here for reproducibility. Please make sure that you change the first line of all tsv files so that the path of the data is set correctly.
We use a pretrained subword tokenizer link to convert LibriSpeech transcripts into subword tokens.
