Magritte
A repository for MaGRiTTE: a project to Machine Generate Representation of Tabular files with Transformer Encoders.
Install / Use
/learn @HPI-Information-Systems/MagritteREADME
MaGRiTTE
This repository contains the artifacts and source code for MaGRiTTE: a project to Machine Generate Representation of Tabular files with Transformer Encoders.
To reproduce the results of the paper, please follow the instructions in the following sections.
Set up the environment
To setup the environment, we recommend using a virtual environment to avoid conflicts with other packages. If using conda, run:
conda env create -f environment.yml
to create the environment, and then
conda activate magritte
to activate it. If using pip, run:
pip3 install -r requirements.txt
Use Case: Pollution Data
The input data to reproduce the use case presented in the paper is available in the data/massbay folder.
Two scripts can be used to integrate these files.
- A hand-crafted script can be launched with
python3 use_case_manual.py. The script will generate the result fileresults/massbay/manual_integrated.csv. - The script that uses MaGRiTTE can be launched with
python3 use_case_magritte.py. The script will generate the result fileresults/massbay/magritte_integrated.csv.
Running the MaGRiTTE version of the script requires downloading the corresponding weights for the model. The weights can be downloaded at url See also section below.
Datasets
The folder data contains the datasets used for the experiments. The datasets are organized in subfolders, each containing the data for a specific task.
Due to the space policy of GitHub, we publish some of our datasets in compressed folders in tar.gz formats, or on external servers.
To simplify download and extraction, we provide a script to automatically downloads and extracts the data (download_data.sh)
Run the download_data.sh script in the root repository folder to download and extract the data (requires a *nix system with megatools installed)
The datasets are organized as follows:
/gittables: contains the data from the gittables dataset, organized for the pretraining task. To download and extract the dataset, rundownload.shin the folder0.dialect_detection: contains the data for the finetuning task for dialect detection. To extract the files automatically, runextract.shin the folder.row_classification: contains the data for the finetuning task for row classification. Every dataset is contained in a separatetar.gzfile, and their annotations are contained in the jsonl filestrain_dev_annotations.jsonlandtest_annotations.jsonl.estimate: contains the data for the finetuning task for estimate.columntype: contains the data for the finetuning task for column type detection classification. There are three versions of the dataset, one for each scenario: (1) theunpreparedfolder contains raw files, (2) theautocleanfolder contains the files after automated cleaning with MaGRiTTE, and (3) thecleanfolder contains the ground truth cleaned files. The annotations for the column types of each dataset are contained incsvfiles in each folder.
Model Weights
The folder weights contains the weights for the model used for the experiments.
Due to the space policy of GitHub, we publish the weights on external servers.
The main weights can be found at https://mega.nz/file/wJMhDbIa#Zmr23xd67xktcZtpvuu781Om2uwb1PWtQina6a_zwKg.
To simplify download and extraction, we provide a script to automatically downloads and extracts the data (download_weights.sh)
Run the download_weights.sh script in the root repository folder to download and extract the weights (requires a *nix system with megatools installed).
Alternatively, the weights can be manually downloaded from the links provided in the links.txt file.
The data folder contains the data used for the experiments, arranged in several subfolders.
Training the MaGRiTTE model
- Once the environment is set up and the data has been downloaded, the three folders
configs,embedder, andtrainingcan be used to pretrain/finetune MaGRiTTE on several tasks. configs: contains the configuration files in .jsonnet format used for to train the models and run the experimentsembedder: contains the source code of the MaGRiTTE model, organized in subfolders depending on the pretraining/finetuning tasks.training: contains the scripts to train the models.data: contains the datasets with the ground truths for the finetuning tasks.experiments: contains the scripts to run the experiments for testing purposes after finetuning.weights: contains the weights of the models used for the paper.
For example, to finetune the model to the dialect detection task, it is sufficient to run:
python3 training/train_dialect.py
which reads the configuration file stored in configs/dialect.jsonnet.
Each training will saves intermediate artifacts in a corresponding tensorboard folder under results\{pretrain,dialect,rowclass,estimate}. To visualize the state of training, you can run e.g.
tensorboard --logdir results\pretrain\tensorboard
and open the browser at localhost:6006.
The model weights are saved in the folder weights. If you would like to skip the training phase, we provide the weights of the models used for the paper. Refer to the section Model Weights for instructions on how to download the weights.
Running the experiments
The experiments folder contains the scripts to run the experiments for testing purposes after finetuning. The scripts are organized in subfolders depending on the finetuning tasks.
Each scripts loads the corresponding trained model from the weights folder and runs the experiments on the dev/test set.
To run the experiments, run e.g.
python3 experiments/dialect/magritte.py
The results for each task are saved in the folder results, under a corresponding subfolder.
The folder plots can be used after the experiments to generate the plots for the paper. The plots are generated using Jupyter notebooks, one for each of the tasks, which read the results from the results folder and save the corresponding image in .png format.
