SkillAgentSearch skills...

Transnormer

A lexical normalizer for historical spelling variants using a transformer architecture.

Install / Use

/learn @ybracke/Transnormer
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Transnormer

Transnormer models are byte-level sequence-to-sequence models for normalizing historical German text.

[!NOTE]

  • If you are only interested in using a Transnormer model for normalizing your data, see the first section of the README (Public models).
  • If you want to train your own Transnormer model or are interested in the implementation details, see also the rest of the README.

Public models

We release Transnormer models and evaluation results on the Hugging Face Hub.

Overview

This is an overview of the published models and their evaluation results in comparison with an identity baseline:

| Model | Test set | Time period | WordAcc | WordAcc (-i) | | --- | --- | --- | --- | --- | | Identity baseline | DTA reviEvalCorpus-v1 | 1780-1899 | 91.45 | 93.25 | | transnormer-19c-beta-v02 | DTA reviEvalCorpus-v1 | 1780-1899 | 98.88 | 99.34 | | Identity baseline | DTAK-transnormer-v1 (18c) | 1700-1799 | 88.94 | 90.42 | | transnormer-18-19c-beta-v01 | DTAK-transnormer-v1 (18c) | 1700-1799 | 99.53 | 99.62 | | Identity baseline | DTAK-transnormer-v1 (19c) | 1800-1899 | 95.00 | 95.31 | | transnormer-18-19c-beta-v01 | DTAK-transnormer-v1 (19c) | 1800-1899 | 99.46 | 99.53 |

Notes:

  • The metric WordAcc is a harmonized word accurracy (Bawden et al. 2022), more details can be found below.
  • (-i) denotes a case insensitive version (i.e. deviations in casing between prediction and gold normalization are ignored).
  • For the identity baseline we only replace outdated characters by their modern counterpart (e.g. "ſ" -> "s", "aͤ" -> "ä") and otherwise treat the original as the normalization.

How to normalize with the models

Transnormer models are easy to use for generating normalizations with the transformers library:

from transformers import pipeline

transnormer = pipeline(model='ybracke/transnormer-19c-beta-v02')
sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
print(transnormer(sentence, num_beams=4, max_length=128))
# >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]

The folder demo/ contains notebooks and scripts that demonstrate in more detail how to apply the models.

Training and test configurations for public models

To make training and evaluation for public models reproducible, the training_config and test_config files (see documentation below) for these models are published in the folder configs/.


Installation

In order to reproduce model training and evaluation, install the dependencies and code as described in this section and refer to the documentation in the section on Usage.

1. Set up environment

1.a On a GPU <!-- omit in toc -->

If you have a GPU available, you should first install and set up a conda environment.

conda install -y pip
conda create -y --name <environment-name> python=3.9 pip
conda activate <environment-name>

conda install -y cudatoolkit=11.3.1 cudnn=8.3.2 -c conda-forge
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
pip install torch==1.12.1+cu113 torchvision torchaudio -f https://download.pytorch.org/whl/torch_stable.html

1.b On a CPU <!-- omit in toc -->

Set up a virtual environment, e.g.:

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

2.a Install package from GitHub

pip install git+https://github.com/ybracke/transnormer.git

2.b Editable install for developers

# Clone repo from GitHub
git clone git@github.com:ybracke/transnormer.git
cd ./transnormer
# install package in editable mode
pip install -e .
# install development requirements
pip install -r requirements-dev.txt

3. Requirements

To train a Transnormer model you need the following resources:

Usage

Quickstart

  1. Prepare environment (see below)
  2. Prepare data (see below)

Quickstart Training

  1. Specify the training parameters in the training config file
  2. Run training script: $ python3 src/transnormer/models/model_train.py.

For more details, see below

Quickstart Generation and Evaluation

  1. Specify the generation parameters in the test config file
  2. Specify file paths in pred_eval.sh, then run: bash pred_eval.sh

For more details, see sections on Generation and Evaluation.

Preparation 1: Virtual environment

venv <!-- omit in toc -->

source .venv/bin/activate

Conda <!-- omit in toc -->

conda activate <environment-name>
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
  • If you have multiple GPUs available and want to use only one (e.g. the GPU with index 1):
    • export CUDA_VISIBLE_DEVICES=1
    • Set gpu = "cuda:0" in config file
  • export TOKENIZERS_PARALLELISM=false to get rid of parallelism warning messages

Preparation 2: Data preparation

The training and test data must be in JSONL format, where each record is a parallel training sample, e.g. a sentence. The records in the files must at least have the following format:

{
    "orig" : "Eyn Theylſtueck", // original spelling
    "norm" : "Ein Teilstück"    // normalized spelling
}

See repository transnormer-data for more information.

1. Model training

  1. Specify the training parameters in the config file

  2. Run training script: $ python3 src/transnormer/models/model_train.py. Training can take multiple hours, so consider using nohup: $ nohup nice python3 src/transnormer/models/train_model.py &

Training config file

The file training_config.toml specifies the training configurations, e.g. training data, base model, training hyperparameters. Update the file before fine-tuning model.

The following paragraphs provide detailed explanations of each section and parameter within the configuration file to facilitate effective model training.

1. Select GPU <!-- omit in toc -->

The gpu parameter sets the GPU device used for training. You can set it to the desired GPU identifier, such as "cuda:0", ensuring compatibility with the CUDA environment. Remember to set the appropriate CUDA visible devices beforehand, if required (e.g. export CUDA_VISIBLE_DEVICES=1 to use only the GPU with index 1).

2. Random Seed (Reproducibility) <!-- omit in toc -->

The random_seed parameter defines a fixed random seed (42 in the default settings) to ensure reproducibility of the training process.

3. Data Paths and Subset Sizes <!-- omit in toc -->

The [data] section references the training and evaluation data. paths_train and paths_validation are lists of paths to JSONL files or to directories that only contain JSONL files. See data preparation for more information on the data format. Additionally, n_examples_train, and n_examples_validation specify the number of examples to be used from each dataset split during training.

Both paths_{split} and n_examples_{split} are lists. The number at n_examples_{split}[i] refers t

Related Skills

View on GitHub
GitHub Stars10
CategoryDevelopment
Updated9mo ago
Forks1

Languages

Python

Security Score

82/100

Audited on Jun 18, 2025

No findings