This is an implementation of NAACL 2022 paper "Cross-modal Contrastive Learning for Speech Translation" (read paper here). The implementation based on fairseq codebase.
CONTRIBUTION: You are also more than welcomed to test our code on your machines, and report feedbacks on results, bugs and performance!
The motivation of our ConST model is to learn similar representations for semantically similar speech and text.
ConST (1) inherits the advantages of multi-task learning (as shown in our previous paper XSTNet (with code)), (2) while employing a contrastive learning approach to bridge the gap between low-level speech representation and text embedding.
We report case-sensitive detokenized BLEU via sacrebleu toolkit.
Model | En-De | En-Es | En-Fr | En-It | En-Nl | En-Pt | En-Ro | En-Ru | Avg. |
---|---|---|---|---|---|---|---|---|---|
ConST-base | 25.7 | 30.4 | 36.8 | 26.3 | 30.6 | 32.0 | 24.8 | 17.3 | 28.0 |
ConST-expand | 28.3 | 32.0 | 38.3 | 27.2 | 31.7 | 33.1 | 25.6 | 18.9 | 29.4 |
Experience our end-to-end voice translation system on Huggingface Space now! Record a sentence in English and translate it into other languages! You are a TRANSLATOR!
HERE IS THE WEBSITE:
https://huggingface.co/spaces/ReneeYe/ConST-speech2text-translator
P.S. Since huggingface space only provides CPU, it will take 12-20 seconds to inference and generate the translation result.
The models are trained based on pytorch. You may download all the models at 🤗huggingface model.
Datasets | Model | SPM & Vocab |
---|---|---|
En-De | Download | SPM model; Vocab |
En-Es | Download | SPM model; Vocab |
En-Fr | Download | SPM model; Vocab |
En-It | Download | SPM model; Vocab |
En-Nl | Download | SPM model; Vocab |
En-Pt | Download | SPM model; Vocab |
En-Ro | Download | SPM model; Vocab |
En-Ru | Download | SPM model; Vocab |
- PyTorch version >= 1.5.0
- Python version >= 3.6
- For training new models, you'll also need an NVIDIA GPU and NCCL
git clone git@github.com:ReneeYe/ConST.git
cd ConST
pip3 install -r requirements.txt
pip3 install --editable ./
The instructions of data pre-processing are here. To train the model, take En-De as an example, you may run:
bash ConST/scripts/train_en2x.sh de checkpoint/model_saved.
We strongly recommend that you average the checkpoints after you get the best checkpoint with highest BLEU on dev set.
python3 ConST/scripts/average_checkpoints.py --inputs checkpoint/model_saved \
--num-update-checkpoints 10 --checkpoint-upper-bound ${step-to-get-the-best-dev} \
--output ${path-to-averaged-ckpt}
Then generate and evaluate your model.
fairseq-generate data/ --gen-subset tst-COMMON_st --task speech_to_text --prefix-size 1 \
--max-tokens 4000000 --max-source-positions 4000000 --beam 10 \
--config-yaml config_st.yaml --path ${path-to-averaged-ckpt} \
--scoring sacrebleu
@InProceedings{ye2022cross,
author = {Rong Ye and Mingxuan Wang and Lei Li},
booktitle = {Proc. of NAACL},
title = {Cross-modal Contrastive Learning for Speech Translation },
year = {2022}
}