SkillAgentSearch skills...

Corpora

Parallel corpora for the biomedical domain

Install / Use

/learn @biomedical-translation-corpora/Corpora
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Welcome to the repositories of the WMT Biomedical Translation Task

Here we host various datasets that we have compiled for the Biomedical Translation Task at WMT.

  • Medline dataset of titles and abstracts of scientific publications (FR/EN, PT/EN, ES/EN, DE/EN, ZH/EN, RO/EN, IT/EN, RU/EN)
  • Scielo of scientific publications (FR/EN, PT/EN, ES/PT)
  • EDP dataset of scientific publications (FR/EN)
  • ReBEC clinical trials (PT/EN)

List of corpora

Medline corpus

| datasets | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 | | ---------- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | | training | WMT'16 | | | WMT'19 | WMT'20 | | WMT'22<sup>1</sup> | | | | test set | | | WMT'18 | WMT'19 | WMT'20 | WMT'21 | WMT'22 | WMT'23 | WMT'24 |

<sup>1</sup> The parallel abstracts can be retrieved from Medline using our script: wmtbio22_train_data.py. It uses biopython and you'll need a valid email to access the data in Medline.

| training | 2016 | 2019 | 2020 | 2022 | | ---------- | ------ | ------ | ------ | ------ | | en/es | x | x | | x | | en/fr | x | x | | x | | en/pt | x | x | | x | | en/de | | x | | x | | en/it | | | x | x | | en/ru | | | x | x |

| test set | 2018 | 2019 | 2020 | 2021 | | ---------- | ------ | ------ | ------ | ------ | | en/es | x | x | x | x | | en/fr | x | x | x | x | | en/pt | x | x | x | x | | en/de | x | x | x | x | | en/zh | x | x | x | x | | en/ro | x | | | | | en/it | | | x | x | | en/ru | | | x | x |

Scielo corpus

| test set | 2016 | 2017 | | ---------- | ------ | ------ | | en/es, en/fr, en/pt | test WMT'16 | test WMT'17 |

| training | parallel | monolingual | | ---------- | ------ | ------ | | en/es, en/fr, en/pt | training | monolingual |

EDP corpus

| test set | 2017 | | ---------- | ------ | | en/fr | test WMT'17 |

ReBEC corpus

| training | | | ---------- | ------ | | en/pt | dataset |

Publications

Please cite our publications if you use our corpora.

(WMT'22 Biomedical Task) Neves M, Jimeno Yepes A, SiuA, Roller R, Thomas P, Vicente Navarro M, Yeganova L, Wiemann D, Di Nunzio GM, Vezzani F, Gerardin C, Bawden R, Johan Estrada D, Lima-Lopez S, Farre-Maduel E, Krallinger M, Grozea C, Neveol A. Findings of the WMT 2022 Biomedical Translation Shared Task: Monolingual Clinical Case Reports. PDF BibText

(WMT'21 Biomedical Task) Yeganova L, Wiemann D, Neves M, Vezzani F, Siu A, Jauregi Unanue I, Oronoz M, Mah N, Névéol A, Martinez D, Bawden R, Di Nunzio GM, Roller R, Thomas P, Grozea C, Perez-de-Viñaspre O, Vicente Navarro M and Jimeno Yepes A. Findings of the WMT 2021 Biomedical Translation Shared Task: Summaries of Animal Experiments as New Test Set, 6th Conference on Machine Translation, EMNLP 2021. PDF and BibText

(WMT'20 Biomedical Task) Bawden R, Di Nunzio GM, Grozea C, Jauregi Unanue I, Jimeno Yepes A, Mah N, Martinez D, Neveol A, Neves M, Oronoz M, Perez de Viñaspre O, Piccardi M, Roller R, Siu A, Thomas P, Vezzani F, Vicente Navarro M, Wiemann D, Yeganova L. Findings of the WMT 2020 Biomedical Translation Shared Task: Basque, Italian and Russian as New Additional Languages, 5th Conference on Machine Translation, EMNLP 2020, online. PDF and BibText

(Survey of Authors’ Abstract Writing Practice) Neveol A, Jimeno Yepes A, Neves M. MEDLINE as a Parallel Corpus: a Survey to Gain Insight on French-, Spanish-and Portuguese-Speaking Authors’ Abstract Writing Practice, 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France. PDF and BibText

(WMT'19 Biomedical Task) Bawden R, Bretonnel Cohen K, Grozea C, Jimeno Yepes A, Kittner M, Krallinger M, Mah N, Neveol A, Neves M, Soares F, Siu A, Verspoor A, Vicente Navarro M. Findings of the WMT 2019 Biomedical Translation Shared Task: Evaluation for MEDLINE Abstracts and Biomedical Terminologies , 4th Conference on Machine Translation, ACL 2019, Florence, Italy. PDF and BibText

(WMT'18 Biomedical Task) Neves M, Jimeno Yepes A, Névéol A, Grozea C, Siu A, Kittner M, Verspoor K. Findings of the WMT 2018 Biomedical Translation Shared Task: Evaluation on Medline test sets, Proceedings of the Third Conference on Machine Trasnlation (WMT) at EMNLP, 2018, Brussels, Belgium. PDF and BibText

(Parallel Biomedical Corpora) Névéol A, Jimeno Yepes A, Neves M, Verspoor K. Parallel Corpora for the Biomedical Domain, International Conference on Language Resources and Evaluation (LREC), 2018, Myazaki, Japan. PDF and BibText

(WMT'17 Biomedical Task) Jimeno Yepes A, Névéol A, Neves M, Verspoor K, Bojar O, Boyer A, Grozea C, Haddow H, Kittner M, Lichtblau Y, Pecina P, Roller R, Rosa R, Siu A, Thomas P, Trescher S. Findings of the WMT 2017 Biomedical Translation Shared Task, Proceedings of the Second Conference on Machine Translation (WMT17) at the Conference on Empirical Methods on Natural Language Processing (EMNLP 2017), Copenhagen, Denmark. PDF and BibText

(WMT'16 Biomedical Task) Bojar O, Chatterjee R, Federmann C, Graham Y, Haddow B, Huck M, Jimeno Yepes A, Koehn P, Logacheva V, Monz C, Negri M, Névéol A, Neves M, Popel M, Post M, Rubino R, Scarton C, Specia L, Turchi M, Verspoor K and Zampieri M. Findings of the 2016 Conference on Machine Translation, ACL 2016, Proceedings of the First Conference on Machine Translation (WMT16), pp. 131-198, 2016, Berlin, Germany. PDF and BibText

(Scielo corpus) Neves M, Jimeno-Yepes A and Névéol A. The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine, International Conference on Language Resources and Evaluation (LREC), 2016, Portoroz, Slovenia. PDF and Bibtex

Support or Contact

Please contact us by mail. Please also join our discussion forum.

  • Antonio Jimeno Yepes (RMIT University, Australia)
  • Aurélie Névéol (LIMSI, CNRS, France)
  • Mariana Neves (German Federal Institute for Risk Assessment, Germany)

Related Skills

View on GitHub
GitHub Stars50
CategoryHealthcare
Updated8mo ago
Forks9

Languages

Python

Security Score

72/100

Audited on Jul 23, 2025

No findings