Transnormer
A lexical normalizer for historical spelling variants using a transformer architecture.
Install / Use
/learn @ybracke/TransnormerREADME
Transnormer
Transnormer models are byte-level sequence-to-sequence models for normalizing historical German text.
[!NOTE]
- If you are only interested in using a Transnormer model for normalizing your data, see the first section of the README (Public models).
- If you want to train your own Transnormer model or are interested in the implementation details, see also the rest of the README.
Public models
We release Transnormer models and evaluation results on the Hugging Face Hub.
Overview
This is an overview of the published models and their evaluation results in comparison with an identity baseline:
| Model | Test set | Time period | WordAcc | WordAcc (-i) | | --- | --- | --- | --- | --- | | Identity baseline | DTA reviEvalCorpus-v1 | 1780-1899 | 91.45 | 93.25 | | transnormer-19c-beta-v02 | DTA reviEvalCorpus-v1 | 1780-1899 | 98.88 | 99.34 | | Identity baseline | DTAK-transnormer-v1 (18c) | 1700-1799 | 88.94 | 90.42 | | transnormer-18-19c-beta-v01 | DTAK-transnormer-v1 (18c) | 1700-1799 | 99.53 | 99.62 | | Identity baseline | DTAK-transnormer-v1 (19c) | 1800-1899 | 95.00 | 95.31 | | transnormer-18-19c-beta-v01 | DTAK-transnormer-v1 (19c) | 1800-1899 | 99.46 | 99.53 |
Notes:
- The metric WordAcc is a harmonized word accurracy (Bawden et al. 2022), more details can be found below.
- (-i) denotes a case insensitive version (i.e. deviations in casing between prediction and gold normalization are ignored).
- For the identity baseline we only replace outdated characters by their modern counterpart (e.g. "ſ" -> "s", "aͤ" -> "ä") and otherwise treat the original as the normalization.
How to normalize with the models
Transnormer models are easy to use for generating normalizations with the transformers library:
from transformers import pipeline
transnormer = pipeline(model='ybracke/transnormer-19c-beta-v02')
sentence = "Die Königinn ſaß auf des Pallaſtes mittlerer Tribune."
print(transnormer(sentence, num_beams=4, max_length=128))
# >>> [{'generated_text': 'Die Königin saß auf des Palastes mittlerer Tribüne.'}]
The folder demo/ contains notebooks and scripts that demonstrate in more detail how to apply the models.
Training and test configurations for public models
To make training and evaluation for public models reproducible, the training_config and test_config files (see documentation below) for these models are published in the folder configs/.
Installation
In order to reproduce model training and evaluation, install the dependencies and code as described in this section and refer to the documentation in the section on Usage.
1. Set up environment
1.a On a GPU <!-- omit in toc -->
If you have a GPU available, you should first install and set up a conda environment.
conda install -y pip
conda create -y --name <environment-name> python=3.9 pip
conda activate <environment-name>
conda install -y cudatoolkit=11.3.1 cudnn=8.3.2 -c conda-forge
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
pip install torch==1.12.1+cu113 torchvision torchaudio -f https://download.pytorch.org/whl/torch_stable.html
1.b On a CPU <!-- omit in toc -->
Set up a virtual environment, e.g.:
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
2.a Install package from GitHub
pip install git+https://github.com/ybracke/transnormer.git
2.b Editable install for developers
# Clone repo from GitHub
git clone git@github.com:ybracke/transnormer.git
cd ./transnormer
# install package in editable mode
pip install -e .
# install development requirements
pip install -r requirements-dev.txt
3. Requirements
To train a Transnormer model you need the following resources:
- A pre-trained encoder-decoder model (available on the Huggingface Model Hub)
- A parallel corpus of dataset of historical language documents with (gold-)normalized labels (also available on the Hugging Face Hub)
- A file specifying the training configurations, see Training config file
Usage
Quickstart
Quickstart Training
- Specify the training parameters in the training config file
- Run training script:
$ python3 src/transnormer/models/model_train.py.
For more details, see below
Quickstart Generation and Evaluation
- Specify the generation parameters in the test config file
- Specify file paths in
pred_eval.sh, then run:bash pred_eval.sh
For more details, see sections on Generation and Evaluation.
Preparation 1: Virtual environment
venv <!-- omit in toc -->
source .venv/bin/activate
Conda <!-- omit in toc -->
conda activate <environment-name>
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
- If you have multiple GPUs available and want to use only one (e.g. the GPU with index
1):export CUDA_VISIBLE_DEVICES=1- Set
gpu = "cuda:0"in config file
export TOKENIZERS_PARALLELISM=falseto get rid of parallelism warning messages
Preparation 2: Data preparation
The training and test data must be in JSONL format, where each record is a parallel training sample, e.g. a sentence. The records in the files must at least have the following format:
{
"orig" : "Eyn Theylſtueck", // original spelling
"norm" : "Ein Teilstück" // normalized spelling
}
See repository transnormer-data for more information.
1. Model training
-
Specify the training parameters in the config file
-
Run training script:
$ python3 src/transnormer/models/model_train.py. Training can take multiple hours, so consider usingnohup:$ nohup nice python3 src/transnormer/models/train_model.py &
Training config file
The file training_config.toml specifies the training configurations, e.g. training data, base model, training hyperparameters. Update the file before fine-tuning model.
The following paragraphs provide detailed explanations of each section and parameter within the configuration file to facilitate effective model training.
1. Select GPU <!-- omit in toc -->
The gpu parameter sets the GPU device used for training. You can set it to the desired GPU identifier, such as "cuda:0", ensuring compatibility with the CUDA environment. Remember to set the appropriate CUDA visible devices beforehand, if required (e.g. export CUDA_VISIBLE_DEVICES=1 to use only the GPU with index 1).
2. Random Seed (Reproducibility) <!-- omit in toc -->
The random_seed parameter defines a fixed random seed (42 in the default settings) to ensure reproducibility of the training process.
3. Data Paths and Subset Sizes <!-- omit in toc -->
The [data] section references the training and evaluation data. paths_train and paths_validation are lists of paths to JSONL files or to directories that only contain JSONL files. See data preparation for more information on the data format. Additionally, n_examples_train, and n_examples_validation specify the number of examples to be used from each dataset split during training.
Both paths_{split} and n_examples_{split} are lists. The number at n_examples_{split}[i] refers t
Related Skills
node-connect
343.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
90.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
