ConST
code for paper "Cross-modal Contrastive Learning for Speech Translation" (NAACL 2022)
Install / Use
/learn @ReneeYe/ConSTREADME
ConST: Cross-modal Contrastive Learning for Speech Translation
This is an implementation of NAACL 2022 paper "Cross-modal Contrastive Learning for Speech Translation" (read paper here). The implementation based on fairseq codebase.
CONTRIBUTION: You are also more than welcomed to test our code on your machines, and report feedbacks on results, bugs and performance!
👀 Overview
The motivation of our ConST model is to learn similar representations for semantically similar speech and text.
<div align="center"> <img src="ConST/resources/motivation_figure.png" width="80%"> </div>ConST (1) inherits the advantages of multi-task learning (as shown in our previous paper XSTNet (with code)), (2) while employing a contrastive learning approach to bridge the gap between low-level speech representation and text embedding.
<div align="center"> <img src="ConST/resources/ConST_figure.png" width="100%"> </div>Result on MuST-C En-X dataset
We report case-sensitive detokenized BLEU via sacrebleu toolkit.
| Model | En-De | En-Es | En-Fr | En-It | En-Nl | En-Pt | En-Ro | En-Ru | Avg. | | ---------- |:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:| |ConST-base | 25.7 | 30.4 | 36.8 | 26.3 | 30.6 | 32.0 | 24.8 | 17.3 | 28.0 | |ConST-expand| 28.3 | 32.0 | 38.3 | 27.2 | 31.7 | 33.1 | 25.6 | 18.9 | 29.4 |
🤗 Huggingface Space Demo available now!
Experience our end-to-end voice translation system on Huggingface Space now! Record a sentence in English and translate it into other languages! You are a TRANSLATOR!
HERE IS THE WEBSITE:
https://huggingface.co/spaces/ReneeYe/ConST-speech2text-translator
<div align="center"> <img src="ConST/resources/demo_screenshot.png" width="100%"> </div>P.S. Since huggingface space only provides CPU, it will take 12-20 seconds to inference and generate the translation result.
⬇️ Download Trained Models
The models are trained based on pytorch. You may download all the models at 🤗huggingface model.
| Datasets | Model | SPM & Vocab | |:--------:|:----------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | En-De | Download | SPM model; Vocab | | En-Es | Download | SPM model; Vocab | | En-Fr | Download | SPM model; Vocab | | En-It | Download | SPM model; Vocab | | En-Nl | Download | SPM model; Vocab | | En-Pt | Download | SPM model; Vocab | | En-Ro | Download | SPM model; Vocab | | En-Ru | Download | SPM model; Vocab |
Training & Generation Instruction
⚙️ Requirements and Installation
- PyTorch version >= 1.5.0
- Python version >= 3.6
- For training new models, you'll also need an NVIDIA GPU and NCCL
git clone git@github.com:ReneeYe/ConST.git
cd ConST
pip3 install -r requirements.txt
pip3 install --editable ./
📉 Pre-processing and Training
The instructions of data pre-processing are here. To train the model, take En-De as an example, you may run:
bash ConST/scripts/train_en2x.sh de checkpoint/model_saved.
🤖️ Inference, Generation and Evaluation
We strongly recommend that you average the checkpoints after you get the best checkpoint with highest BLEU on dev set.
python3 ConST/scripts/average_checkpoints.py --inputs checkpoint/model_saved \
--num-update-checkpoints 10 --checkpoint-upper-bound ${step-to-get-the-best-dev} \
--output ${path-to-averaged-ckpt}
Then generate and evaluate your model.
fairseq-generate data/ --gen-subset tst-COMMON_st --task speech_to_text --prefix-size 1 \
--max-tokens 4000000 --max-source-positions 4000000 --beam 10 \
--config-yaml config_st.yaml --path ${path-to-averaged-ckpt} \
--scoring sacrebleu
✏️ Citation
@InProceedings{ye2022cross,
author = {Rong Ye and Mingxuan Wang and Lei Li},
booktitle = {Proc. of NAACL},
title = {Cross-modal Contrastive Learning for Speech Translation },
year = {2022}
}
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
flutter-tutor
Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
16.9kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
