RuTaBERT
RuTaBERT is a framework for solving column type and property annotation problems based on fine-tuning a pre-trained language model (e.g., BERT) using a large-scale corpus of Russian-language tables.
Install / Use
/learn @YRL-AIDA/RuTaBERTREADME
RuTaBERT
Model for solving the problem of Column Type Annotation with BERT, trained on RWT-RuTaBERT dataset.
RWT-RuTaBERT dataset contains 1 441 349 columns from Russian language Wikipedia tables. With headers matching 170 DBpedia semantic types. It has fixed train / test split:
| Split | Columns | Tables | Avg. columns per table |
|-------|-----------| ------- | ---------------------- |
| Test | 115 448 | 55 080 | 2.096 |
| Train | 1 325 901 | 633 426 | 2.093 |
Table of contents
Benchmark
We trained RuTaBERT with two table serialization strategies:
- Neighboring column serialization;
- Multi-column serialization (based on Doduo's approach);
Benchmark results on RWT-RuTaBERT dataset: | Serialization strategy | micro-F1 | macro-F1 | weighted-F1 | | ---------------------- | -------- | -------- | ----------- | | Multi-column | 0.962 | 0.891 | 0.9621 | | Neighboring column | 0.964 | 0.904 | 0.9639 |
Training parameters: | Parameter | Value | | --------------- | ----- | | batch size | 32 | | epochs | 30 | | Loss function | Cross-entropy | | GD Optimizer | AdamW(lr=5e-5, eps=1e-8) | | GPU's | 4 NVIDIA A100 (80 GB) | | random seed | 2024 | | validation split | 5% |
Project structure
📦RuTaBERT
┣ 📂checkpoints
┃ ┗ Saved PyTorch models `.pt`
┣ 📂data
┃ ┣ 📂inference
┃ ┃ ┗ Tabels to inference `.csv`
┃ ┣ 📂test
┃ ┃ ┗ Test dataset files `.csv`
┃ ┣ 📂train
┃ ┃ ┗ Train dataset files `.csv`
┃ ┗ Directory for storing dataset files.
┣ 📂dataset
┃ ┗ Dataset wrapper classes, dataloaders
┣ 📂logs
┃ ┗ Log files (train / test / error)
┣ 📂model
┃ ┗ Model and metrics
┣ 📂trainer
┃ ┗ Trainer
┣ 📂utils
┃ ┗ Helper functions
┗ Entry points (train.py, test.py, inference.py), configuration, etc.
Configuration
The model configuration can be found in the file config.json.
The configuratoin argument parameters are listed below:
| argument | description |
|-------------|-------------|
| num_labels | Number of labels used for classification |
| num_gpu | Number of GPUs to use |
| save_period_in_epochs | Number characterizing with what periodicity the checkpoint is saved (in epochs) |
| metrics | The classification metrics used are |
| pretrained_model_name | BERT shortcut name from HuggingFace |
| table_serialization_type | Method of serializing a table into a sequence |
| batch_size | Batch size |
| num_epochs | Number of training epochs |
| random_seed | Random seed |
| logs_dir | Directory for logging |
| train_log_filename | File name for train logging |
| test_log_filename | File name for test logging |
| start_from_checkpoint | Flag to start training from checkpoint |
| checkpoint_dir | Directory for storing checkpoints of model |
| checkpoint_name | File name of a checkpoint (model state) |
| inference_model_name | File name of a model for inference |
| inference_dir | Directory for storing inference tables .csv |
| dataloader.valid_split | Amount of validation subset split |
| dataloader.num_workers | Number of dataloader workers |
| dataset.num_rows | Number of readable rows in the dataset, if null read all rows in files |
| dataset.data_dir | Directory for storing train/test/inference files |
| dataset.train_path | Directory for storing train dataset files .csv |
| dataset.test_path | Direcotry for storing test dataset files .csv |
We recomend to change ONLY theese parameters:
num_gpu- Any positive ingeter number + {0}.0stand for training / testing on CPU.save_period_in_epochs- Any positive integer number, measures in epochs.table_serialization_type- "column_wise" or "table_wise".pretrained_model_name- BERT shorcut names from Huggingface PyTorch pretrained models.batch_size- Any positive integer number.num_epochs- Any positive integer number.random_seed- Any integer number.start_from_checkpoint- "true" or "false".checkpoint_name- Any name of model, saved incheckpointdirectory.inference_model_name- Any name of model, saved incheckpointdirectory. But we recommend to use the best models: [model_best_f1_weighted.pt, model_best_f1_macro.pt, model_best_f1_micro.pt].dataloader.valid_split- Real number within range [0.0, 1.0] (0.0 stands for 0 % of train subset, 0.5 stands for 50 % of train subset). Or positive integer number (Denoting a fixed number of a validation subset).dataset.num_rows- "null" stands for read all lines in dataset files. Positive integer means the number of lines to read in the files of the dataset.
Dataset files
Before training / testing the model you need to:
- Download dataset repository in the same directory as RuTaBERT, example source directory strucutre:
├── src
│ ├── RuTaBERT
│ ├── RuTaBERT-Dataset
│ │ ├── move_dataset.sh
- Run script
move_dataset.shfrom dataset repository, to move dataset files into RuTaBERTdatadirectory:
RuTaBERT-Dataset$ ./move_dataset.sh
- configure
config.jsonfile before training.
Training
RuTaBERT supports training / testing locally and inside Docker container. Also supports slurm workload manager.
Locally
- Create virtual environment:
RuTaBERT$ virtualenv venv
or
RuTaBERT$ python -m virtualenv venv
- Install requirements and start train and test.
RuTaBERT$ source venv/bin/activate &&\
pip install -r requirements.txt &&\
python3 train.py 2> logs/error_train.log &&\
python3 test.py 2> logs/error_test.log
- Models will be saved in
checkpointdirectory. - Output will be in
logs/directory (training_results.csv,train.log,test.log,error_train.log,error_test.log).
Docker
Requirements:
- Docker installation guide (ubuntu);
- NVIDIA driver;
- NVIDIA Container Toolkit installation guide (ubuntu);
- Make sure all dependencies are installed.
- Build image:
RuTaBERT$ sudo docker build -t rutabert .
- Run image
RuTaBERT$ sudo docker run -d --runtime=nvidia --gpus=all \
--mount source=rutabert_logs,target=/app/rutabert/logs \
--mount source=rutabert_checkpoints,target=/app/rutabert/checkpoints \
rutabert
- Move models and logs from container after training / testing.
RuTaBERT$ sudo cp -r /var/lib/docker/volumes/rutabert_checkpoints/_data ./checkpoints
RuTaBERT$ sudo cp -r /var/lib/docker/volumes/rutabert_logs/_data ./logs
- Don't forget to remove volumes after training! Docker wont do it for you.
- Models will be saved in
checkpointdirectory. - Output will be in
logs/directory (training_results.csv,train.log,test.log,error_train.log,error_test.log).
Slurm
- Create virtual environment:
RuTaBERT$ virtualenv venv
or
RuTaBERT$ python -m virtualenv venv
- Run slurm script:
RuTaBERT$ sbatch run.slurm
- Check job status:
RuTaBERT$ squeue
- Models will be saved in
checkpointdirectory. - Output will be in
logs/directory (train.log,test.log,error_train.log,error_test.log).
Testing
- Make sure data placed in
data/testdirectory. - (Optional) Download pre-trained models:
RuTaBERT$ ./download.sh table_wise
or
RuTaBERT$ ./download.sh column_wise
- Configure which model to test in
config.json. - Run:
RuTaBERT$ source venv/bin/activate &&\
pip install -r requirements.txt &&\
python3 test.py 2> logs/error_test.log
- Output will be in
logs/directory (test.log,error_test.log).
Inference
- Make sure data placed in
data/inferencedirectory. - (Optional) Download pre-trained models:
RuTaBERT$ ./download.sh table_wise
or
RuTaBERT$ ./download.sh column_wise
- Configure which model to inference in
config.json - Run:
RuTaBERT$ source venv/bin/activate &&\
pip install -r requirements.txt &&\
python3 inference.py
- Labels will be in
data/inference/result.csv
