MultiSim
Code and Data for using the MultiSim Benchmark
Install / Use
/learn @XenonMolecule/MultiSimREADME
MultiSim

Code and Data for using the MultiSim Benchmark from the ACL 2023 paper Revisiting non-English Text Simplification: A Unified Multilingual Benchmark
HuggingFace
The data is available on HuggingFace here!
Usage
from datasets import load_dataset
dataset = load_dataset("MichaelR207/MultiSimV2")
Citation
If you use this benchmark please cite our paper:
@inproceedings{ryan-etal-2023-revisiting,
title = "Revisiting non-{E}nglish Text Simplification: A Unified Multilingual Benchmark",
author = "Ryan, Michael and
Naous, Tarek and
Xu, Wei",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.269",
pages = "4898--4927",
abstract = "Recent advancements in high-quality, large-scale English resources have pushed the frontier of English Automatic Text Simplification (ATS) research. However, less work has been done on multilingual text simplification due to the lack of a diverse evaluation benchmark that covers complex-simple sentence pairs in many languages. This paper introduces the MultiSim benchmark, a collection of 27 resources in 12 distinct languages containing over 1.7 million complex-simple sentence pairs. This benchmark will encourage research in developing more effective multilingual text simplification models and evaluation metrics. Our experiments using MultiSim with pre-trained multilingual language models reveal exciting performance improvements from multilingual training in non-English settings. We observe strong performance from Russian in zero-shot cross-lingual transfer to low-resource languages. We further show that few-shot prompting with BLOOM-176b achieves comparable quality to reference simplifications outperforming fine-tuned models in most languages. We validate these findings through human evaluation.",
}
Contact
Michael Ryan: Scholar | Twitter | Github | LinkedIn | Research Gate | Personal Website | michaeljryan@gatech.edu
Data Availability
Public Datasets
Most of the public datasets are available as a part of this MultiSim Repo. A few are still pending availability. For all resources we provide alternative download links. | Dataset | Language | Availability in MultiSim Repo | Alternative Link | |---|---|---|---| | ASSET | English | Available | https://huggingface.co/datasets/asset | | WikiAuto | English | Available | https://huggingface.co/datasets/wiki_auto | | CLEAR | French | Available | http://natalia.grabar.free.fr/resources.php#remi | | WikiLargeFR | French | Available | http://natalia.grabar.free.fr/resources.php#remi | | GEOLino | German | Available | https://github.com/Jmallins/ZEST-data | | TextComplexityDE | German | Available | https://github.com/babaknaderi/TextComplexityDE | | AdminIT | Italian | Available | https://github.com/Unipisa/admin-It | | Simpitiki | Italian | Available | https://github.com/dhfbk/simpitiki# | | PaCCSS-IT | Italian | Available | http://www.italianlp.it/resources/paccss-it-parallel-corpus-of-complex-simple-sentences-for-italian/ | | Terence and Teacher | Italian | Available | http://www.italianlp.it/resources/terence-and-teacher/ | | Easy Japanese | Japanese | Available | https://www.jnlp.org/GengoHouse/snow/t15 | | Easy Japanese Extended | Japanese | Available | https://www.jnlp.org/GengoHouse/snow/t23 | | RuAdapt Encyclopedia | Russian | Available | https://github.com/Digital-Pushkin-Lab/RuAdapt | | RuAdapt Fairytales | Russian | Available | https://github.com/Digital-Pushkin-Lab/RuAdapt | | RuSimpleSentEval | Russian | Available | https://github.com/dialogue-evaluation/RuSimpleSentEval | | RuWikiLarge | Russian | Available | https://github.com/dialogue-evaluation/RuSimpleSentEval | | SloTS | Slovene | Available | https://github.com/sabina-skubic/text-simplification-slovene | | SimplifyUR | Urdu | Pending | https://github.com/harisbinzia/SimplifyUR | | PorSimples | Brazilian Portuguese | Available | sandra@icmc.usp.br |
On Request Datasets
The authors of the original papers must be contacted for on request datasets. Contact information for the authors of each dataset is provided below. | Dataset | Language | Contact | |---|---|---| | CBST | Basque | http://www.ixa.eus/node/13007?language=en <br/> itziar.gonzalezd@ehu.eus | | DSim | Danish | sk@eyejustread.com | | Newsela EN | English | https://newsela.com/data/ | | Newsela ES | Spanish | https://newsela.com/data/ | | German News | German | ebling@cl.uzh.ch | | Simple German | German | ebling@cl.uzh.ch | | Simplext | Spanish | horacio.saggion@upf.edu | | RuAdapt Literature | Russian | Partially Available: https://github.com/Digital-Pushkin-Lab/RuAdapt <br/> Full Dataset: anna.dmitrieva@helsinki.fi |
Specific Citations
Please cite the individual datasets that you use within the MultiSim benchmark as appropriate. Proper bibtex attributions for each of the datasets are included below
AdminIT
@inproceedings{miliani-etal-2022-neural,
title = "Neural Readability Pairwise Ranking for Sentences in {I}talian Administrative Language",
author = "Miliani, Martina and
Auriemma, Serena and
Alva-Manchego, Fernando and
Lenci, Alessandro",
booktitle = "Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing",
month = nov,
year = "2022",
address = "Online only",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.aacl-main.63",
pages = "849--866",
abstract = "Automatic Readability Assessment aims at assigning a complexity level to a given text, which could help improve the accessibility to information in specific domains, such as the administrative one. In this paper, we investigate the behavior of a Neural Pairwise Ranking Model (NPRM) for sentence-level readability assessment of Italian administrative texts. To deal with data scarcity, we experiment with cross-lingual, cross- and in-domain approaches, and test our models on Admin-It, a new parallel corpus in the Italian administrative language, containing sentences simplified using three different rewriting strategies. We show that NPRMs are effective in zero-shot scenarios ({\textasciitilde}0.78 ranking accuracy), especially with ranking pairs containing simplifications produced by overall rewriting at the sentence-level, and that the best results are obtained by adding in-domain data (achieving perfect performance for such sentence pairs). Finally, we investigate where NPRMs failed, showing that the characteristics of the training data, rather than its size, have a bigger effect on a model{'}s performance.",
}
ASSET
@inproceedings{alva-manchego-etal-2020-asset,
title = "{ASSET}: {A} Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations",
author = "Alva-Manchego, Fernando and
Martin, Louis and
Bordes, Antoine and
Scarton, Carolina and
Sagot, Beno{\^\i}t and
Specia, Lucia",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.424",
pages = "4668--4679",
}
CBST
@article{10.1007/s10579-017-9407-6,
title={{The corpus of Basque simplified texts (CBST)}},
author={Gonzalez-Dios, Itziar and Aranzabe, Mar{\'\i}a Jes{\'u}s and D{\'\i}az de Ilarraza, Arantza},
journal={Language Resources and Evaluation},
volume={52},
number={1},
pages={217--247},
year={2018},
publisher={Springer}
}
CLEAR
@inproceedings{grabar-cardon-2018-clear,
title = "{CLEAR} {--} Simple Corpus for Medical {F}rench",
author = "Grabar, Natalia and
Cardon, R{\'e}mi",
booktitle = "Proceedings of the 1st Workshop on Automatic Text Adaptation ({ATA})",
month = nov,
year = "2018",
address = "Tilburg, the Netherlands",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/W18-7002",
doi = "10.18653/v1/W18-7002",
pages = "3--9",
}
DSim
@inproceedings{klerke-sogaard-2012-dsim,
title = "{DS}im, a {D}anish Parallel Corpus for Text Simplification",
author = "Klerke, Sigrid and
S{\o}gaard, Anders",
booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
month = may,
year = "2012",
address = "Istanbul, Turkey",
publisher = "European Language Resources Associ
Related Skills
node-connect
344.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
96.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
