SkillAgentSearch skills...

ConST

code for paper "Cross-modal Contrastive Learning for Speech Translation" (NAACL 2022)

Install / Use

/learn @ReneeYe/ConST

README

ConST: Cross-modal Contrastive Learning for Speech Translation

This is an implementation of NAACL 2022 paper "Cross-modal Contrastive Learning for Speech Translation" (read paper here). The implementation based on fairseq codebase.

CONTRIBUTION: You are also more than welcomed to test our code on your machines, and report feedbacks on results, bugs and performance!

👀 Overview

The motivation of our ConST model is to learn similar representations for semantically similar speech and text.

<div align="center"> <img src="ConST/resources/motivation_figure.png" width="80%"> </div>

ConST (1) inherits the advantages of multi-task learning (as shown in our previous paper XSTNet (with code)), (2) while employing a contrastive learning approach to bridge the gap between low-level speech representation and text embedding.

<div align="center"> <img src="ConST/resources/ConST_figure.png" width="100%"> </div>

Result on MuST-C En-X dataset

We report case-sensitive detokenized BLEU via sacrebleu toolkit.

| Model | En-De | En-Es | En-Fr | En-It | En-Nl | En-Pt | En-Ro | En-Ru | Avg. | | ---------- |:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:| |ConST-base | 25.7 | 30.4 | 36.8 | 26.3 | 30.6 | 32.0 | 24.8 | 17.3 | 28.0 | |ConST-expand| 28.3 | 32.0 | 38.3 | 27.2 | 31.7 | 33.1 | 25.6 | 18.9 | 29.4 |

🤗 Huggingface Space Demo available now!

Experience our end-to-end voice translation system on Huggingface Space now! Record a sentence in English and translate it into other languages! You are a TRANSLATOR!

HERE IS THE WEBSITE:

https://huggingface.co/spaces/ReneeYe/ConST-speech2text-translator

<div align="center"> <img src="ConST/resources/demo_screenshot.png" width="100%"> </div>

P.S. Since huggingface space only provides CPU, it will take 12-20 seconds to inference and generate the translation result.

⬇️ Download Trained Models

The models are trained based on pytorch. You may download all the models at 🤗huggingface model.

| Datasets | Model | SPM & Vocab | |:--------:|:----------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| | En-De | Download | SPM model; Vocab | | En-Es | Download | SPM model; Vocab | | En-Fr | Download | SPM model; Vocab | | En-It | Download | SPM model; Vocab | | En-Nl | Download | SPM model; Vocab | | En-Pt | Download | SPM model; Vocab | | En-Ro | Download | SPM model; Vocab | | En-Ru | Download | SPM model; Vocab |

Training & Generation Instruction

⚙️ Requirements and Installation

  • PyTorch version >= 1.5.0
  • Python version >= 3.6
  • For training new models, you'll also need an NVIDIA GPU and NCCL
git clone git@github.com:ReneeYe/ConST.git
cd ConST
pip3 install -r requirements.txt
pip3 install --editable ./

📉 Pre-processing and Training

The instructions of data pre-processing are here. To train the model, take En-De as an example, you may run:

bash ConST/scripts/train_en2x.sh de checkpoint/model_saved.

🤖️ Inference, Generation and Evaluation

We strongly recommend that you average the checkpoints after you get the best checkpoint with highest BLEU on dev set.

python3 ConST/scripts/average_checkpoints.py --inputs checkpoint/model_saved \
--num-update-checkpoints 10 --checkpoint-upper-bound ${step-to-get-the-best-dev} \
--output ${path-to-averaged-ckpt}

Then generate and evaluate your model.

fairseq-generate data/ --gen-subset tst-COMMON_st --task speech_to_text --prefix-size 1 \
--max-tokens 4000000 --max-source-positions 4000000 --beam 10 \
--config-yaml config_st.yaml  --path ${path-to-averaged-ckpt} \
--scoring sacrebleu

✏️ Citation

@InProceedings{ye2022cross,
  author    = {Rong Ye and Mingxuan Wang and Lei Li},
  booktitle = {Proc. of NAACL},
  title     = {Cross-modal Contrastive Learning for Speech Translation },
  year      = {2022}
}

Related Skills

View on GitHub
GitHub Stars65
CategoryEducation
Updated4mo ago
Forks5

Languages

Python

Security Score

97/100

Audited on Nov 27, 2025

No findings