LEXTREME: A Multi-Lingual Benchmark Dataset for Legal Language Understanding

Lately, propelled by the phenomenal advances around the transformer architecture, the legal NLP field has enjoyed spectacular growth. To measure progress, well curated and challenging benchmarks are crucial. However, most benchmarks are English only and in legal NLP specifically there is no multilingual benchmark available yet. Additionally, many benchmarks are saturated, with the best models clearly outperforming the best humans and achieving near perfect scores. We survey the legal NLP literature and select 11 datasets covering 24 languages, creating LEXTREME. To provide a fair comparison, we propose two aggregate scores, one based on the datasets and one on the languages. The best baseline ( XLM-R large) achieves both a dataset aggregate score a language aggregate score of 61.3. This indicates that LEXTREME is still very challenging and leaves ample room for improvement. To make it easy for researchers and practitioners to use, we release LEXTREME on huggingface together with all the code required to evaluate models and a public Weights and Biases project with all the runs.

Leaderboard

LEXTREME Scores

The final LEXTREME score is computed using the harmonic mean of the dataset and the language aggregate score, thus weighing datasets and languages equally, promoting model fairness and robustness following Shavrina and Malykh (2021) and Chalkidis et al,.

We evaluated multilingual models as well as monolingual models. The multilingual models are the following:

| Model | Source | Parameters | Vocabulary Size | Pretraining Specs | Pretraining Corpora | Pretraining Languages | | ----------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | -------------- | ------------------- | --------------------- | --------------------------------------------------------------------- | ------------------------- | | MiniLM | Wang et al. (2020) | 118M | 250K | 1M steps / BS 256 | 2.5TB CC100 | 100 | | DistilBert | Sanh (2019) | 135M | 120K | BS up to 4000 | Wikipedia | 104 | | mDeberta-v3 | He et al. (2020, 2021) | 278M | 128K | 500K steps / BS 8192 | 2.5TB CC100 | 100 | | XLM-R base | Conneau et al. (2020) | 278M | 250K | 1.5M steps / BS 8192 | 2.5TB CC100 | 100 | | XLM-R large | Conneau et al. (2020) | 560M | 250K | 1.5M steps / BS 8192 | 2.5TB CC100 | 100 | | Legal-XLM-R-base | Niklaus et al. 2023 | 184M | 128K | 1M steps / BS 512 | 689GB MLP | 24 | | Legal-XLM-R-large | Niklaus et al. 2023 | 435M | 128K | 500K steps / BS 512 | 689GB MLP | 24 | | Legal-XLM-LF-base | Niklaus et al. 2023 | 208M | 128K | 50K steps / BS 512 | 689GB MLP | 24 | | Legal-mono-R-base | Niklaus et al. 2023 | 111M | 32K | 200K steps / BS 512 | 689GB MLP | 1 | | Legal-mono-R-large | Niklaus et al. 2023 | 337M | 32K | 500K steps / BS 512 | 689GB MLP | 1 |

In the following, we will provide the results on the basis of the multilingual models.

Dataset aggregate scores for multilingual models. The best scores are in bold.

We compute the dataset aggregate score by taking the successive harmonic mean of (1.) the languages inside the configurations (e.g., de,fr,it within SJP), (2.) the configurations inside the datasets (e.g., OTS-UL, OTS-CT within OTS), and (3.) the datasets inside LEXTREME (BCD, GAM, etc.).

| Model | BCD | GAM | GLC | SJP | OTS | C19 | MEU | GLN | LNR | LNB | MAP | Agg. | | ----------------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | | MiniLM | 53.0 | 73.3 | 42.1 | 67.7 | 44.1 | 5.0 | 29.7 | 74.0 | 84.5 | 93.6 | 57.8 | 56.8 | | DistilBERT | 54.5 | 69.5 | 62.8 | 66.8 | 56.1 | 25.9 | 36.4 | 71.0 | 85.3 | 89.6 | 60.8 | 61.7 | | mDeBERTa-v3 | 60.2 | 71.3 | 52.2 | 69.1 | 66.5 | 29.7 | 37.4 | 73.3 | 85.1 | 94.8 | 67.2 | 64.3 | | XLM-R-base | 63.5 | 72.0 | 57.4 | 69.3 | 67.8 | 26.4 | 33.3 | 74.6 | 85.8 | 94.1 | 62.0 | 64.2 | | XLM-R-large | 58.7 | 73.1 | 57.4 | 69.0 | 75.0 | 29.0 | 42.2 | 74.1 | 85.0 | 95.3 | 68.0 | 66.1 | | Legal-XLM-R-base | 62.5 | 72.4 | 68.9 | 70.2 | 70.8 | 30.7 | 38.6 | 73.6 | 84.1 | 94.1 | 69.2 | 66.8 | | Legal-XLM-R-large | 63.3 | 73.9 | 59.3 | 70.1 | 74.9 | 34.6 | 39.7 | 73.1 | 83.9 | 94.6 | 67.3 | 66.8 | | Legal-XLM-LF-base | 72.4 | 74.6 | 70.2 | 72.9 | 69.8 | 26.3 | 33.1 | 72.1 | 84.7 | 93.3 | 66.2 | 66.9 |

Language aggregate scores for multilingual models. The best scores are in bold.

We compute the language aggregate score by taking the successive harmonic mean of (1.) the configurations inside the datasets, (2.) the datasets for the given language (e.g., MAP and MEU for lv), and (3.) the languages inside LEXTREME ( bg,cs, etc.).

| Model | bg | cs | da | de | el | en | es | et | fi | fr | ga | hr | hu | it | lt | lv | mt | nl | pl | pt | ro | sk | sl | sv | Agg. | | ----------------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | | MiniLM | 52.7 | 48.6 | 42.8 | 54.6 | 50.3 | 34.3 | 40.1 | 46.3 | 42.2 | 39.0 | 42.8 | 29.7 | 29.6 | 40.5 | 44.2 | 40.8 | 40.8 | 29.5 | 22.7 | 61.6 | 59.6 | 44.3 | 30.0 | 43.4 | 40.5 | | DistilBERT | 54.2 | 48.6 | 46.0 | 60.1 | 58.8 | 48.0 | 50.0 | 48.8 | 49.6 | 47.9 | 51.4 | 35.9 | 31.2 | 50.1 | 51.9 | 41.5 | 44.4 | 34.6 | 34.5 | 63.2 | 63.8 | 51.3 | 36.2 | 50.1 | 46.7 | | mDeBERTa-v3 | 54.1 | 51.3 | 51.7 | 63.6 | 57.7 | 50.7 | 53.3 | 50.8 | 54.6 | 49.2 | 54.9 | 37.4 | 37.5 | 55.1 | 53.9 | 47.0 | 52.5 | 42.1 | 41.0 | 65.7 | 65.3 | 55.4 | 37.5 | 56.1 | 50.5 | | XLM-R-base | 56.4 | 48.3 | 48.3 | 60.6 | 57.6 | 50.1 | 47.2 | 46.7 | 48.6 | 49.4 | 50.1 | 33.6 | 32.8 | 53.4 | 50.0 | 44.1 | 43.8 | 35.2 | 41.3 | 66.1 | 63.7 | 45.3 | 33.7 | 50.0 | 47.1 | | XLM-R-large

LEXTREME

Install / Use

README

LEXTREME: A Multi-Lingual Benchmark Dataset for Legal Language Understanding

Leaderboard

LEXTREME Scores

Dataset aggregate scores for multilingual models. The best scores are in bold.

Language aggregate scores for multilingual models. The best scores are in bold.