ViDeBERTa
ViDeBERTa: A powerful pre-trained language model for Vietnamese, EACL 2023
Install / Use
/learn @HySonLab/ViDeBERTaREADME
ViDeBERTa: A powerful pre-trained language model for Vietnamese, EACL 2023
Paper: https://aclanthology.org/2023.findings-eacl.79.pdf

Contributors
- Tran Cong Dao
- Pham Nhut Huy
- Nguyen Tuan Anh
- Hy Truong Son (Correspondent / PI)
Main components
<a name="pretraining"></a> Pre-training
Code architecture
- bash: bash scripts to run the pipeline
- config: model_config (json files)
- dataset: datasets folder (both store original txt dataset and the pointer to memory of datasets.load_from_disk)
- source: main python files to run pre-training tokenizers
- tokenizer: folder to store tokenizers
Pre-tokenizer
- Split the original txt datasets into train, validation and test sets with 90%, 5%, 5%.
- Using the PyVi library to segment the datasets
- Save datasets to disk
Pre-train_tokenizer
- Load datasets
- Train the tokenizers with SentencePiece models
- Save tokenizers
Pre-train_model
- Load datasets
- Load tokenizers
- Pre-train DeBERTa-v3
<a name="videberta"></a> Model
<a name="finetuning"></a> Fine-tuning
Code architecture
- POS tagging and NER (POS_NER)
- Question Answering (QA and QA2)
- Open-domain Question Answering (OPQA)
