Club
Official repository of the Catalan Language Understanding Benchmark (CLUB) to evaluate NLP models.
Install / Use
/learn @projecte-aina/ClubREADME
CLUB: Catalan Language Understanding Benchmark
Tasks and datasets
The CLUB benchmark consists of 5 tasks, that are Part-of-Speech Tagging (POS), Named Entity Recognition (NER), Text Classification (TC), Semantic Textual Similarity (STS) and Question Answering (QA). For more information, refer to the HuggingFace datasets cards and Zenodo links below :
-
AnCora (POS):
-
Splits info:
- train: 13,123 examples
- validation: 1,709 examples
- test: 1,846 examples
-
dataset card: https://huggingface.co/datasets/universal_dependencies
-
data source: https://github.com/UniversalDependencies/UD_Catalan-AnCora
-
-
AnCora-ner (NER):
-
Splits info:
- train: 10,628 examples
- validation: 1,427 examples
- test: 1,526 examples
-
dataset card: https://huggingface.co/datasets/projecte-aina/ancora-ca-ner
-
data source: https://zenodo.org/record/4762031#.YKaFjqGxWUk)
-
-
TeCla (TC):
-
STS-ca (STS):
-
Splits info:
- train: 2,073 examples
- validation: 500 examples
- test: 500 examples
-
dataset card: https://huggingface.co/datasets/projecte-aina/sts-ca
-
data source: https://doi.org/10.5281/zenodo.4529183
-
-
ViquiQuAD (QA):
-
Splits info:
- train: 11,255 examples
- validation: 1,492 examples
- test: 1,429 examples
-
dataset card: https://huggingface.co/datasets/projecte-aina
-
data source: https://doi.org/10.5281/zenodo.4562344
-
-
XQuAD (QA):
- Splits info:
- test: 1,190 examples
- Splits info:
-
dataset card: https://huggingface.co/datasets/projecte-aina/xquad-ca
-
data source: https://doi.org/10.5281/zenodo.4526223
-
TECA (Textual Entailment)
-
Splits info:
- train: 16,930 examples
- validation: 2116
- test: 2117
-
dataset card: https://huggingface.co/datasets/projecte-aina/teca
-
data source: https://zenodo.org/record10.5281/zenodo.4593271.
-
-
CaSum (text summarization)
-
Splits info:
- train: 197,735 examples
- validation: 10,000 examples
- test: 10,000 examples
-
dataset card: https://huggingface.co/datasets/projecte-aina/casum
-
-
VilaSum (text summarization)
-
Splits info:
- train: 13,843 examples
-
dataset card: https://huggingface.co/datasets/projecte-aina/vilasum
-
BERTa
BERTa is a transformer-based masked language model for the Catalan language, based on the RoBERTA base model
Pretrained model: https://huggingface.co/PlanTL-GOB-ES/roberta-base-ca <br/> Training corpora: https://doi.org/10.5281/zenodo.4519348
Fine-tune and evaluate on CLUB
To fine-tune and evaluate your model on the CLUB benchamark, run the following commands:
bash setup_venv.sh
bash run_club.sh <model_name_on_HF>
The commands above will run fine-tuning and evaluation on CLUB and the results will be shown in the results-model_name_on_HF.json file. and the logs in the run_club-model_name_on_HF.log file.
Fine-tuning and evaluation
For each model we used the same fine-tuning setting across tasks, consisting of 10 training epochs, with an effective batch size of 32 instances, a max input length of 512 tokens (128 tokens in the case of Textual Entailment though) and a learning rate of 5e−5. The rest of the hyperparameters are set to the default values in Huggingface Transformers scripts. We then select the best checkpoint as the one that maximised the task-specific metric on the corresponding validation set, and finally evaluate it on the test set.
Results
Evaluations results obtained running the scripts above with the <model_name_on_HF> set to PlanTL-GOB-ES/roberta-base-ca:
| Model | NER (F1) | POS (F1) | STS (Pearson) | TC (accuracy) | QA (ViquiQuAD) (F1/EM) | QA (XQuAD) (F1/EM) | TE (TECA) (accuracy) | | ------------|:-------------:| -----:|:------|:-------|:------|:----|:----| | BERTa | 89.63 | 98.93 | 81.20 | 74.04 | 86.99/73.25 | 67.81/49.43 | 79.12 | | mBERT | 86.38 | 98.82 | 76.34 | 70.56 | 86.97/72.22 | 67.15/46.51 | 74.78 | | XLM-RoBERTa | 87.66 | 98.89 | 75.40 | 71.68 | 85.50/70.47 | 67.10/46.42 | 75.44 | | WikiBERT-ca | 77.66 | 97.60 | 77.18 | 73.22 | 85.45/70.75 | 65.21/36.60 | x |
How to cite
If you use any of these resources (datasets or models) in your work, please cite our latest paper:
@inproceedings{armengol-estape-etal-2021-multilingual,
title = "Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? {A} Comprehensive Assessment for {C}atalan",
author = "Armengol-Estap{\'e}, Jordi and
Carrino, Casimiro Pio and
Rodriguez-Penagos, Carlos and
de Gibert Bonet, Ona and
Armentano-Oller, Carme and
Gonzalez-Agirre, Aitor and
Melero, Maite and
Villegas, Marta",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.437",
doi = "10.18653/v1/2021.findings-acl.437",
pages = "4933--4946",
}
Related Skills
node-connect
343.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
92.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
