Transnormer

A lexical normalizer for historical spelling variants using a transformer architecture.

Generate Convert Improve

Install / Use

/learn @ybracke/Transnormer

About this skill

Quality Score

0/100

README

`Transnormer`

Transnormer models are byte-level sequence-to-sequence models for normalizing historical German text.

[!NOTE]

If you are only interested in using a Transnormer model for normalizing your data, see the first section of the README (Public models).

If you want to train your own Transnormer model or are interested in the implementation details, see also the rest of the README.

Transnormer

Public models

We release Transnormer models and evaluation results on the Hugging Face Hub.

Overview

This is an overview of the published models and their evaluation results in comparison with an identity baseline:

| Model | Test set | Time period | WordAcc | WordAcc (-i) | | --- | --- | --- | --- | --- | | Identity baseline | DTA reviEvalCorpus-v1 | 1780-1899 | 91.45 | 93.25 | | transnormer-19c-beta-v02 | DTA reviEvalCorpus-v1 | 1780-1899 | 98.88 | 99.34 | | Identity baseline | DTAK-transnormer-v1 (18c) | 1700-1799 | 88.94 | 90.42 | | transnormer-18-19c-beta-v01 | DTAK-transnormer-v1 (18c) | 1700-1799 | 99.53 | 99.62 | | Identity baseline | DTAK-transnormer-v1 (19c) | 1800-1899 | 95.00 | 95.31 | | transnormer-18-19c-beta-v01 | DTAK-transnormer-v1 (19c) | 1800-1899 | 99.46 | 99.53 |

Notes:

The metric WordAcc is a harmonized word accurracy (Bawden et al. 2022), more details can be found below.
(-i) denotes a case insensitive version (i.e. deviations in casing between prediction and gold normalization are ignored).
For the identity baseline we only replace outdated characters by their modern counterpart (e.g. "ſ" -> "s", "aͤ" -> "ä") and otherwise treat the original as the normalization.

How to normalize with the models

Transnormer models are easy to use for generating normalizations with the transformers library:

from transformers import pipeline

transnormer = pipeline(model='ybracke/transnormer-19c-beta-v02')
sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
print(transnormer(sentence, num_beams=4, max_length=128))
# >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]

The folder demo/ contains notebooks and scripts that demonstrate in more detail how to apply the models.

Training and test configurations for public models

To make training and evaluation for public models reproducible, the training_config and test_config files (see documentation below) for these models are published in the folder configs/.

Installation

In order to reproduce model training and evaluation, install the dependencies and code as described in this section and refer to the documentation in the section on Usage.

1. Set up environment

1.a On a GPU

If you have a GPU available, you should first install and set up a conda environment.

conda install -y pip
conda create -y --name <environment-name> python=3.9 pip
conda activate <environment-name>

conda install -y cudatoolkit=11.3.1 cudnn=8.3.2 -c conda-forge
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
pip install torch==1.12.1+cu113 torchvision torchaudio -f https://download.pytorch.org/whl/torch_stable.html

1.b On a CPU

Set up a virtual environment, e.g.:

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

2.a Install package from GitHub

pip install git+https://github.com/ybracke/transnormer.git

2.b Editable install for developers

# Clone repo from GitHub
git clone git@github.com:ybracke/transnormer.git
cd ./transnormer
# install package in editable mode
pip install -e .
# install development requirements
pip install -r requirements-dev.txt

3. Requirements

To train a Transnormer model you need the following resources:

A pre-trained encoder-decoder model (available on the Huggingface Model Hub)
A parallel corpus of dataset of historical language documents with (gold-)normalized labels (also available on the Hugging Face Hub)
A file specifying the training configurations, see Training config file

Usage

Quickstart

Prepare environment (see below)
Prepare data (see below)

Quickstart Training

Specify the training parameters in the training config file
Run training script: $ python3 src/transnormer/models/model_train.py.

For more details, see below

Quickstart Generation and Evaluation

Specify the generation parameters in the test config file
Specify file paths in pred_eval.sh, then run: bash pred_eval.sh

For more details, see sections on Generation and Evaluation.

Preparation 1: Virtual environment

`venv`

source .venv/bin/activate

Conda

conda activate <environment-name>
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/

If you have multiple GPUs available and want to use only one (e.g. the GPU with index 1):
- export CUDA_VISIBLE_DEVICES=1
- Set gpu = "cuda:0" in config file
export TOKENIZERS_PARALLELISM=false to get rid of parallelism warning messages

Preparation 2: Data preparation

The training and test data must be in JSONL format, where each record is a parallel training sample, e.g. a sentence. The records in the files must at least have the following format:

{
    "orig" : "Eyn Theylſtueck", // original spelling
    "norm" : "Ein Teilstück"    // normalized spelling
}

See repository transnormer-data for more information.

1. Model training

Specify the training parameters in the config file
Run training script: $ python3 src/transnormer/models/model_train.py. Training can take multiple hours, so consider using nohup: $ nohup nice python3 src/transnormer/models/train_model.py &

Training config file

The file training_config.toml specifies the training configurations, e.g. training data, base model, training hyperparameters. Update the file before fine-tuning model.

The following paragraphs provide detailed explanations of each section and parameter within the configuration file to facilitate effective model training.

1. Select GPU

The gpu parameter sets the GPU device used for training. You can set it to the desired GPU identifier, such as "cuda:0", ensuring compatibility with the CUDA environment. Remember to set the appropriate CUDA visible devices beforehand, if required (e.g. export CUDA_VISIBLE_DEVICES=1 to use only the GPU with index 1).

2. Random Seed (Reproducibility)

The random_seed parameter defines a fixed random seed (42 in the default settings) to ensure reproducibility of the training process.

3. Data Paths and Subset Sizes

The [data] section references the training and evaluation data. paths_train and paths_validation are lists of paths to JSONL files or to directories that only contain JSONL files. See data preparation for more information on the data format. Additionally, n_examples_train, and n_examples_validation specify the number of examples to be used from each dataset split during training.

Both paths_{split} and n_examples_{split} are lists. The number at n_examples_{split}[i] refers t

Related Skills

node-connect

343.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

90.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

ybracke

View profile

View on GitHub

GitHub Stars10

CategoryDevelopment

Updated9mo ago

Forks1

ybracke/transnormer

Languages

Python

Security Score

82/100

Audited on Jun 18, 2025

No findings

Transnormer

Install / Use

README

Transnormer

Public models

Overview

How to normalize with the models

Training and test configurations for public models

Installation

1. Set up environment

1.a On a GPU

1.b On a CPU

2.a Install package from GitHub

2.b Editable install for developers

3. Requirements

Usage

Quickstart

Quickstart Training

Quickstart Generation and Evaluation

Preparation 1: Virtual environment

venv

Conda

Preparation 2: Data preparation

1. Model training

Training config file

1. Select GPU

2. Random Seed (Reproducibility)

3. Data Paths and Subset Sizes

Related Skills

`Transnormer`

`venv`