RuTaBERT

Model for solving the problem of Column Type Annotation with BERT, trained on RWT-RuTaBERT dataset.

RWT-RuTaBERT dataset contains 1 441 349 columns from Russian language Wikipedia tables. With headers matching 170 DBpedia semantic types. It has fixed train / test split: | Split | Columns | Tables | Avg. columns per table | |-------|-----------| ------- | ---------------------- | | Test | 115 448 | 55 080 | 2.096 | | Train | 1 325 901 | 633 426 | 2.093 |

RuTaBERT
- Benchmark
- Project structure
- Configuration
- Dataset files
- Training
  - Locally
  - Docker
  - Slurm
- Testing
- Inference

Benchmark

We trained RuTaBERT with two table serialization strategies:

Neighboring column serialization;
Multi-column serialization (based on Doduo's approach);

Benchmark results on RWT-RuTaBERT dataset: | Serialization strategy | micro-F1 | macro-F1 | weighted-F1 | | ---------------------- | -------- | -------- | ----------- | | Multi-column | 0.962 | 0.891 | 0.9621 | | Neighboring column | 0.964 | 0.904 | 0.9639 |

Training parameters: | Parameter | Value | | --------------- | ----- | | batch size | 32 | | epochs | 30 | | Loss function | Cross-entropy | | GD Optimizer | AdamW(lr=5e-5, eps=1e-8) | | GPU's | 4 NVIDIA A100 (80 GB) | | random seed | 2024 | | validation split | 5% |

Project structure

📦RuTaBERT
 ┣ 📂checkpoints
 ┃ ┗ Saved PyTorch models `.pt` 
 ┣ 📂data
 ┃ ┣ 📂inference
 ┃ ┃ ┗ Tabels to inference `.csv`
 ┃ ┣ 📂test
 ┃ ┃ ┗ Test dataset files `.csv`
 ┃ ┣ 📂train
 ┃ ┃ ┗ Train dataset files `.csv`
 ┃ ┗  Directory for storing dataset files.
 ┣ 📂dataset
 ┃ ┗  Dataset wrapper classes, dataloaders
 ┣ 📂logs
 ┃ ┗ Log files (train / test / error)
 ┣ 📂model
 ┃ ┗ Model and metrics
 ┣ 📂trainer
 ┃ ┗ Trainer
 ┣ 📂utils
 ┃ ┗ Helper functions
 ┗ Entry points (train.py, test.py, inference.py), configuration, etc.

Configuration

The model configuration can be found in the file config.json.

The configuratoin argument parameters are listed below:

| argument | description | |-------------|-------------| | num_labels | Number of labels used for classification | | num_gpu | Number of GPUs to use | | save_period_in_epochs | Number characterizing with what periodicity the checkpoint is saved (in epochs) | | metrics | The classification metrics used are | | pretrained_model_name | BERT shortcut name from HuggingFace | | table_serialization_type | Method of serializing a table into a sequence | | batch_size | Batch size | | num_epochs | Number of training epochs | | random_seed | Random seed | | logs_dir | Directory for logging | | train_log_filename | File name for train logging | | test_log_filename | File name for test logging | | start_from_checkpoint | Flag to start training from checkpoint | | checkpoint_dir | Directory for storing checkpoints of model | | checkpoint_name | File name of a checkpoint (model state) | | inference_model_name | File name of a model for inference | | inference_dir | Directory for storing inference tables .csv | | dataloader.valid_split | Amount of validation subset split | | dataloader.num_workers | Number of dataloader workers | | dataset.num_rows | Number of readable rows in the dataset, if null read all rows in files | | dataset.data_dir | Directory for storing train/test/inference files | | dataset.train_path | Directory for storing train dataset files .csv | | dataset.test_path | Direcotry for storing test dataset files .csv |

We recomend to change ONLY theese parameters:

num_gpu - Any positive ingeter number + {0}. 0 stand for training / testing on CPU.
save_period_in_epochs - Any positive integer number, measures in epochs.
table_serialization_type - "column_wise" or "table_wise".
pretrained_model_name - BERT shorcut names from Huggingface PyTorch pretrained models.
batch_size - Any positive integer number.
num_epochs - Any positive integer number.
random_seed - Any integer number.
start_from_checkpoint - "true" or "false".
checkpoint_name - Any name of model, saved in checkpoint directory.
inference_model_name - Any name of model, saved in checkpoint directory. But we recommend to use the best models: [model_best_f1_weighted.pt, model_best_f1_macro.pt, model_best_f1_micro.pt].
dataloader.valid_split - Real number within range [0.0, 1.0] (0.0 stands for 0 % of train subset, 0.5 stands for 50 % of train subset). Or positive integer number (Denoting a fixed number of a validation subset).
dataset.num_rows - "null" stands for read all lines in dataset files. Positive integer means the number of lines to read in the files of the dataset.

Dataset files

Before training / testing the model you need to:

Download dataset repository in the same directory as RuTaBERT, example source directory strucutre:

├── src
│  ├── RuTaBERT
│  ├── RuTaBERT-Dataset
│  │  ├── move_dataset.sh

Run script move_dataset.sh from dataset repository, to move dataset files into RuTaBERT data directory:

RuTaBERT-Dataset$ ./move_dataset.sh

configure config.json file before training.

Training

RuTaBERT supports training / testing locally and inside Docker container. Also supports slurm workload manager.

Locally

Create virtual environment:

RuTaBERT$ virtualenv venv

RuTaBERT$ python -m virtualenv venv

Install requirements and start train and test.

RuTaBERT$ source venv/bin/activate &&\
    pip install -r requirements.txt &&\
    python3 train.py 2> logs/error_train.log &&\
    python3 test.py 2> logs/error_test.log

Models will be saved in checkpoint directory.
Output will be in logs/ directory (training_results.csv, train.log, test.log, error_train.log, error_test.log).

Docker

Requirements:

Docker installation guide (ubuntu);
NVIDIA driver;
NVIDIA Container Toolkit installation guide (ubuntu);

Make sure all dependencies are installed.
Build image:

RuTaBERT$ sudo docker build -t rutabert .

Run image

RuTaBERT$ sudo docker run -d --runtime=nvidia --gpus=all \
    --mount source=rutabert_logs,target=/app/rutabert/logs \
    --mount source=rutabert_checkpoints,target=/app/rutabert/checkpoints \
    rutabert

Move models and logs from container after training / testing.

RuTaBERT$ sudo cp -r /var/lib/docker/volumes/rutabert_checkpoints/_data ./checkpoints

RuTaBERT$ sudo cp -r /var/lib/docker/volumes/rutabert_logs/_data ./logs

Don't forget to remove volumes after training! Docker wont do it for you.
Models will be saved in checkpoint directory.
Output will be in logs/ directory (training_results.csv, train.log, test.log, error_train.log, error_test.log).

Slurm

Create virtual environment:

RuTaBERT$ virtualenv venv

RuTaBERT$ python -m virtualenv venv

Run slurm script:

RuTaBERT$ sbatch run.slurm

Check job status:

RuTaBERT$ squeue

Models will be saved in checkpoint directory.
Output will be in logs/ directory (train.log, test.log, error_train.log, error_test.log).

Testing

Make sure data placed in data/test directory.
(Optional) Download pre-trained models:

RuTaBERT$ ./download.sh table_wise

RuTaBERT$ ./download.sh column_wise

Configure which model to test in config.json.
Run:

RuTaBERT$ source venv/bin/activate &&\
    pip install -r requirements.txt &&\
    python3 test.py 2> logs/error_test.log

Output will be in logs/ directory (test.log, error_test.log).

Inference

Make sure data placed in data/inference directory.
(Optional) Download pre-trained models:

RuTaBERT$ ./download.sh table_wise

RuTaBERT$ ./download.sh column_wise

Configure which model to inference in config.json
Run:

RuTaBERT$ source venv/bin/activate &&\
    pip install -r requirements.txt &&\
    python3 inference.py

Labels will be in data/inference/result.csv

RuTaBERT

Install / Use

README

RuTaBERT

Table of contents

Benchmark

Project structure

Configuration

Dataset files

Training

Locally

Docker

Slurm

Testing

Inference