MultiSim

Code and Data for using the MultiSim Benchmark

Generate Convert Improve

Install / Use

/learn @XenonMolecule/MultiSim

About this skill

Quality Score

0/100

README

MultiSim

Figure showing four complex and simple sentence pairs. One pair in English, one in Japanese, one in Urdu, and one in Russian. The English complex sentence reads "He settled in London, devoting himself chiefly to practical teaching." which is paired with the simple sentence "He lived in London. He was a teacher."

Code and Data for using the MultiSim Benchmark from the ACL 2023 paper Revisiting non-English Text Simplification: A Unified Multilingual Benchmark

HuggingFace

The data is available on HuggingFace here!

Usage

from datasets import load_dataset

dataset = load_dataset("MichaelR207/MultiSimV2")

Citation

If you use this benchmark please cite our paper:

@inproceedings{ryan-etal-2023-revisiting,
    title = "Revisiting non-{E}nglish Text Simplification: A Unified Multilingual Benchmark",
    author = "Ryan, Michael  and
      Naous, Tarek  and
      Xu, Wei",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.269",
    pages = "4898--4927",
    abstract = "Recent advancements in high-quality, large-scale English resources have pushed the frontier of English Automatic Text Simplification (ATS) research. However, less work has been done on multilingual text simplification due to the lack of a diverse evaluation benchmark that covers complex-simple sentence pairs in many languages. This paper introduces the MultiSim benchmark, a collection of 27 resources in 12 distinct languages containing over 1.7 million complex-simple sentence pairs. This benchmark will encourage research in developing more effective multilingual text simplification models and evaluation metrics. Our experiments using MultiSim with pre-trained multilingual language models reveal exciting performance improvements from multilingual training in non-English settings. We observe strong performance from Russian in zero-shot cross-lingual transfer to low-resource languages. We further show that few-shot prompting with BLOOM-176b achieves comparable quality to reference simplifications outperforming fine-tuned models in most languages. We validate these findings through human evaluation.",
}

Contact

Data Availability

Public Datasets

Most of the public datasets are available as a part of this MultiSim Repo. A few are still pending availability. For all resources we provide alternative download links. | Dataset | Language | Availability in MultiSim Repo | Alternative Link | |---|---|---|---| | ASSET | English | Available | https://huggingface.co/datasets/asset | | WikiAuto | English | Available | https://huggingface.co/datasets/wiki_auto | | CLEAR | French | Available | http://natalia.grabar.free.fr/resources.php#remi | | WikiLargeFR | French | Available | http://natalia.grabar.free.fr/resources.php#remi | | GEOLino | German | Available | https://github.com/Jmallins/ZEST-data | | TextComplexityDE | German | Available | https://github.com/babaknaderi/TextComplexityDE | | AdminIT | Italian | Available | https://github.com/Unipisa/admin-It | | Simpitiki | Italian | Available | https://github.com/dhfbk/simpitiki# | | PaCCSS-IT | Italian | Available | http://www.italianlp.it/resources/paccss-it-parallel-corpus-of-complex-simple-sentences-for-italian/ | | Terence and Teacher | Italian | Available | http://www.italianlp.it/resources/terence-and-teacher/ | | Easy Japanese | Japanese | Available | https://www.jnlp.org/GengoHouse/snow/t15 | | Easy Japanese Extended | Japanese | Available | https://www.jnlp.org/GengoHouse/snow/t23 | | RuAdapt Encyclopedia | Russian | Available | https://github.com/Digital-Pushkin-Lab/RuAdapt | | RuAdapt Fairytales | Russian | Available | https://github.com/Digital-Pushkin-Lab/RuAdapt | | RuSimpleSentEval | Russian | Available | https://github.com/dialogue-evaluation/RuSimpleSentEval | | RuWikiLarge | Russian | Available | https://github.com/dialogue-evaluation/RuSimpleSentEval | | SloTS | Slovene | Available | https://github.com/sabina-skubic/text-simplification-slovene | | SimplifyUR | Urdu | Pending | https://github.com/harisbinzia/SimplifyUR | | PorSimples | Brazilian Portuguese | Available | sandra@icmc.usp.br |

On Request Datasets

The authors of the original papers must be contacted for on request datasets. Contact information for the authors of each dataset is provided below. | Dataset | Language | Contact | |---|---|---| | CBST | Basque | http://www.ixa.eus/node/13007?language=en <br/> itziar.gonzalezd@ehu.eus | | DSim | Danish | sk@eyejustread.com | | Newsela EN | English | https://newsela.com/data/ | | Newsela ES | Spanish | https://newsela.com/data/ | | German News | German | ebling@cl.uzh.ch | | Simple German | German | ebling@cl.uzh.ch | | Simplext | Spanish | horacio.saggion@upf.edu | | RuAdapt Literature | Russian | Partially Available: https://github.com/Digital-Pushkin-Lab/RuAdapt <br/> Full Dataset: anna.dmitrieva@helsinki.fi |

Specific Citations

Please cite the individual datasets that you use within the MultiSim benchmark as appropriate. Proper bibtex attributions for each of the datasets are included below

AdminIT

@inproceedings{miliani-etal-2022-neural,
    title = "Neural Readability Pairwise Ranking for Sentences in {I}talian Administrative Language",
    author = "Miliani, Martina  and
      Auriemma, Serena  and
      Alva-Manchego, Fernando  and
      Lenci, Alessandro",
    booktitle = "Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing",
    month = nov,
    year = "2022",
    address = "Online only",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.aacl-main.63",
    pages = "849--866",
    abstract = "Automatic Readability Assessment aims at assigning a complexity level to a given text, which could help improve the accessibility to information in specific domains, such as the administrative one. In this paper, we investigate the behavior of a Neural Pairwise Ranking Model (NPRM) for sentence-level readability assessment of Italian administrative texts. To deal with data scarcity, we experiment with cross-lingual, cross- and in-domain approaches, and test our models on Admin-It, a new parallel corpus in the Italian administrative language, containing sentences simplified using three different rewriting strategies. We show that NPRMs are effective in zero-shot scenarios ({\textasciitilde}0.78 ranking accuracy), especially with ranking pairs containing simplifications produced by overall rewriting at the sentence-level, and that the best results are obtained by adding in-domain data (achieving perfect performance for such sentence pairs). Finally, we investigate where NPRMs failed, showing that the characteristics of the training data, rather than its size, have a bigger effect on a model{'}s performance.",
}

ASSET

@inproceedings{alva-manchego-etal-2020-asset,
    title = "{ASSET}: {A} Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations",
    author = "Alva-Manchego, Fernando  and
      Martin, Louis  and
      Bordes, Antoine  and
      Scarton, Carolina  and
      Sagot, Beno{\^\i}t  and
      Specia, Lucia",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.424",
    pages = "4668--4679",
}

CBST

@article{10.1007/s10579-017-9407-6,
  title={{The corpus of Basque simplified texts (CBST)}},
  author={Gonzalez-Dios, Itziar and Aranzabe, Mar{\'\i}a Jes{\'u}s and D{\'\i}az de Ilarraza, Arantza},
  journal={Language Resources and Evaluation},
  volume={52},
  number={1},
  pages={217--247},
  year={2018},
  publisher={Springer}
}

CLEAR

@inproceedings{grabar-cardon-2018-clear,
    title = "{CLEAR} {--} Simple Corpus for Medical {F}rench",
    author = "Grabar, Natalia  and
      Cardon, R{\'e}mi",
    booktitle = "Proceedings of the 1st Workshop on Automatic Text Adaptation ({ATA})",
    month = nov,
    year = "2018",
    address = "Tilburg, the Netherlands",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W18-7002",
    doi = "10.18653/v1/W18-7002",
    pages = "3--9",
}

DSim

@inproceedings{klerke-sogaard-2012-dsim,
    title = "{DS}im, a {D}anish Parallel Corpus for Text Simplification",
    author = "Klerke, Sigrid  and
      S{\o}gaard, Anders",
    booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
    month = may,
    year = "2012",
    address = "Istanbul, Turkey",
    publisher = "European Language Resources Associ

Related Skills

node-connect

344.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

96.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。