SkillAgentSearch skills...

VPhoBertTagger

Token classification using Phobert Models for Vietnamese

Install / Use

/learn @datnnt1997/VPhoBertTagger

README

<div align="center">🍜VPhoBertTagger</div>

Token classification using Phobert Models for 🇻🇳Vietnamese

<div align="center">🏞️Environments🏞️</div>

Get started in seconds with verified environments. Run script below for install all dependencies

bash ./install_dependencies.sh

<div align="center">📚Dataset📚</div>

The input data's format of 🍜VPhoBertTagger follows VLSP-2016 format with four columns separated by a tab character, including of word, pos, chunk, and named entity. Each word which was segmented has been put on a separate line and there is an empty line after each sentence. For details, see sample data in 'datasets/samples' directory. The table below describes an example Vietnamese sentence in dataset.

| Word | POS | Chunk | NER | |--------------|-----|-------|-------| | Dương |Np |B-NP |B-PER | | là |V |B-VP |O | | một |M |B-NP |O | | chủ |N |B-NP |O | | cửa hàng |N |B-NP |O | | lâu |A |B-AP |O | | năm |N |B-NP |O | | ở |E |B-PP |O | | Hà Nội |Np |B-NP |B-LOC | | . |CH |O |O |

The dataset must put on directory with structure as below.

├── data_dir
|  └── train.txt
|  └── dev.txt
|  └── test.txt

<div align="center">🎓Training🎓</div>

The commands below fine-tune PhoBert for Token-classification task. Models download automatically from the latest Hugging Face release

python main.py train --task vlsp2016 --run_test --data_dir ./datasets/vlsp2016 --model_name_or_path vinai/phobert-base --model_arch softmax --output_dir outputs --max_seq_length 256 --train_batch_size 32 --eval_batch_size 32 --learning_rate 3e-5 --epochs 20 --early_stop 2 --overwrite_data

or

bash ./train.sh

Arguments:

  • type (str,*required): What is process type to be run. Must in [train, test, predict, demo].
  • task (str, *optional): Training task selected in the list: [vlsp2016, vlsp2018_l1, vlsp2018_l2, vlsp2018_join]. Default: vlsp2016
  • data_dir (Union[str, os.PathLike], *required): The input data dir. Should contain the .csv files (or other data files) for the task.
  • overwrite_data (bool, *optional) : Whether not to overwirte splitted dataset. Default=False
  • load_weights (Union[str, os.PathLike], *optional): Path of pretrained file.
  • model_name_or_path (str, *required): Pre-trained model selected in the list: [vinai/phobert-base, vinai/phobert-large,...] Default=vinai/phobert-base
  • model_arch (str, *required): Punctuation prediction model architecture selected in the list: [softmax, crf, lstm_crf].
  • output_dir (Union[str, os.PathLike], *required): The output directory where the model predictions and checkpoints will be written.
  • max_seq_length (int, *optional): The maximum total input sequence length after WordPiece tokenization. Sequences longer than this will be truncated, and sequences shorter than this will be padded. Default=190.
  • train_batch_size (int, *optional): Total batch size for training. Default=32.
  • eval_batch_size (int, *optional): Total batch size for eval. Default=32.
  • learning_rate (float, *optional): The initial learning rate for Adam. Default=1e-4.
  • classifier_learning_rate (float, *optional): The initial classifier learning rate for Adam. Default=5e-4.
  • epochs (float, *optional): Total number of training epochs to perform. Default=100.0.
  • weight_decay (float, *optional): Weight deay if we apply some. Default=0.01.
  • adam_epsilon (float, *optional): Epsilon for Adam optimizer. Default=5e-8.
  • max_grad_norm (float, *optional): Max gradient norm. Default=1.0.
  • early_stop (float, *optional): Number of early stop step. Default=10.0.
  • no_cuda (bool, *optional): Whether not to use CUDA when available. Default=False.
  • run_test (bool, *optional): Whether not to run evaluate best model on test set after train. Default=False.
  • seed (bool, *optional): Random seed for initialization. Default=42.
  • num_workers (int, *optional): how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. Default=0.
  • save_step (int, *optional): The number of steps in the model will be saved. Default=10000.
  • gradient_accumulation_steps (int, *optional): Number of updates steps to accumulate before performing a backward/update pass. Default=1.

<div align="center">📈Tensorboard📈</div>

The command below start Tensorboard help you follow fine-tune process.

tensorboard --logdir runs --host 0.0.0.0 --port=6006

<div align="center">🥇Performances🥇</div>

All experiments were performed on an RTX 3090 with 24GB VRAM, and a CPU Xeon® E5-2678 v3 with 64GB RAM, both of which are available for rent on vast.ai. The pretrained-model used for comparison are available on HuggingFace.

VLSP 2016

<details> <summary>Click to expand!</summary> <table align="center"> <thead> <tr> <th align="center" rowspan="2" colspan="2">Model</th> <th align="center" colspan="4">BIO-Metrics</th> <th align="center" colspan="5">NE-Metrics</th> <th align="center" rowspan="2">Log</th> </tr> <tr> <th align="center">Accuracy</th> <th align="center">Precision</th> <th align="center">Recall</th> <th align="center">F1-score</th> <th align="center">Accuracy<br>(w/o 'O')</th> <th align="center">Accuracy</th> <th align="center">Precision</th> <th align="center">Recall</th> <th align="center">F1-score</th> </tr> </thead> <tbody> <tr> <td align="left" rowspan="3">Bert-base-multilingual-cased [1]</td> <td align="left">Softmax</td> <td align="center">0.9905</td> <td align="center">0.9239</td> <td align="center">0.8776</td> <td align="center">0.8984</td> <td align="center">0.9068</td> <td align="center">0.9905</td> <td align="center">0.8938</td> <td align="center">0.8941</td> <td align="center">0.8939</td> <td align="left"> <a href="./statics/confusion_matrix/bert_ml_vlsp2016.png">Maxtrix</a> <br/> <a href="./statics/train_logs/bert_ml_vlsp2016.log">Log</a> </td> </tr> <tr> <td align="left">CRF</td> <td align="center">0.9903</td> <td align="center">0.9241</td> <td align="center">0.8880</td> <td align="center">0.9048</td> <td align="center">0.9087</td> <td align="center">0.9903</td> <td align="center">0.8951</td> <td align="center">0.8945</td> <td align="center">0.8948</td> <td align="left"> <a href="./statics/confusion_matrix/bert_ml_crf_vlsp2016.png">Maxtrix</a> <br/> <a href="./statics/train_logs/bert_ml_crf_vlsp2016.log">Log</a> </td> </tr> <tr> <td align="left">LSTM_CRF</td> <td align="center">0.9905</td> <td align="center">0.9183</td> <td align="center">0.8898</td> <td align="center">0.9027</td> <td align="center">0.9178</td> <td align="center">0.9905</td> <td align="center">0.8879</td> <td align="center">0.8992</td> <td align="center">0.8935</td> <td align="left"> <a href="./statics/confusion_matrix/bert_ml_lstm_crf_vlsp2016.png">Maxtrix</a> <br/> <a href="./statics/train_logs/bert_ml_lstm_crf_vlsp2016.log">Log</a> </td> </tr> <tr> <td align="left" rowspan="3">PhoBert-base [2]</td> <td align="left">Softmax</td> <td align="center">0.9950</td> <td align="center">0.9312</td> <td align="center">0.9404</td> <td align="center">0.9348</td> <td align="center">0.9570</td> <td align="center">0.9950</td> <td align="center">0.9434</td> <td align="center">0.9466</td> <td align="center">0.9450</td> <td align="left"> <a href="./statics/confusion_matrix/phobert_softmax_vlsp2016.png">Maxtrix</a> <br/> <a href="./statics/train_logs/phobert_softmax_vlsp2016.log">Log</a> </td> </tr> <tr> <td align="left">CRF</td> <td align="center">0.9949 </td> <td align="center">0.9497</td> <td align="center">0.9248</td> <td align="center">0.9359</td> <td align="center">0.9525</td> <td align="center">0.9949</td> <td align="center">0.9516</td> <td align="center">0.9456</td> <td align="center">0.948

Related Skills

View on GitHub
GitHub Stars13
CategoryDevelopment
Updated11mo ago
Forks3

Languages

Python

Security Score

72/100

Audited on Apr 14, 2025

No findings