SkillAgentSearch skills...

Histnorm

Compiled tools, datasets, and other resources for historical text normalization.

Install / Use

/learn @coastalcph/Histnorm
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Historical Text Normalization

Compiled tools, datasets, and other resources for historical text normalization.

The resources provided here have originally been published along with the following publication:

@inproceedings{bollmann2019-largescale,
  author = {Bollmann, Marcel},
  title = {A Large-Scale Comparison of Historical Text Normalization Systems},
  booktitle = {Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},
  location = {Minneapolis, Minnesota},
  publisher = {Association for Computational Linguistics},
  year = {2019},
  pages = {3885--3898},
  url = {http://www.aclweb.org/anthology/N19-1389},
}

If you use an original part of this repository (such as the provided scripts or the previously unpublished dataset splits), I would appreciate if you cite the above-mentioned publication. If you use one of the referenced datasets and/or tools, please remember to (also) cite these accordingly.

For further reading, a lot of additional details and background information can also be found in:

The following additional material is also available:

Datasets

Language | Source Corpus | Time Period | Genre | Tokens (total) | Source of Splits | ----------- | -------------------------------------------------------------------------------------------------------------------- | ------------ | ---------------- | -------------- | ---------------------------------------------------------- English¹ | ICAMET | 1386-1698 | Letters | 188,158 | HistCorp German | Anselm | 14th-16th c. | Religion | 71,570 | prev. unpublished German | RIDGES | 1482-1652 | Science | 71,570 | prev. unpublished Hungarian | HGDS | 1440-1541 | Religion | 172,064 | HistCorp Icelandic | IcePaHC | 15th c. | Religion | 65,267 | HistCorp Portuguese | Post Scriptum | 15th-19th c. | Letters | 306,946 | prev. unpublished Slovene | goo300k | 1750-1899 | Mixed | 326,538 | KonvNormSl 1.0 Spanish | Post Scriptum | 15th-19th c. | Letters | 132,248 | prev. unpublished Swedish | GaW | 1527-1812 | Official Records | 65,571 | HistCorp

¹Due to licensing restrictions, the ICAMET dataset may not be distributed further, but the HistCorp website contains instructions on how to obtain the same dataset splits.

Helpful Scripts

The scripts/ directory contains a collection of scripts that was used in the process of running the normalization experiments in Bollmann (2019). This includes preprocessing scripts, evaluation and significance testing scripts, and more. For more details, please see the README file in the scripts/ folder.

TL;DR: The Recommended Normalization Approach

In most cases, you want to combine a naive memorization baseline (for in-vocabulary tokens) with a good, trained model (for out-of-vocabulary tokens).

  • If you have little training data (<500 tokens), you probably want to use the Norma tool (which already includes a naive memorization component); see "Using Norma" for details.

  • Otherwise, the results from Bollmann (2019) suggest using cSMTiser as the trained model in this scenario; see below under "Using cSMTiser".

The naive memorization component can be trained as follows:

scripts/memorizer.py train german-lexicon.txt german-anselm.train.txt

Apply it via:

scripts/memorizer.py apply german-lexicon.txt german-anselm.dev.txt > dev.memo.pred

To combine naive memorization with a trained model (for out-of-vocabulary tokens), first train and apply one of the normalizers discussed below. If the predictions of that trained model are in dev.model.pred, you can then apply this combined strategy via:

scripts/memorizer.py combine german-lexicon.txt dev.model.pred german-anselm.dev.txt

This will output a new prediction file that returns the learned memorization if possible, and the corresponding line from dev.model.pred otherwise.

Tools

The following tools are evaluated in Bollmann (2019):

The detailed instructions below assume that the data files are provided in the same format as contained in this repository; i.e., as tab-separated text files where the first column contains a historical word form and the second column contains its normalization.

Using Norma

Norma (and at least one of its dependencies) needs to be compiled manually on your system before it can be used. Detailed instructions for this can be found in the Norma repository.

To use Norma, you need to:

  1. Prepare a configuration file; you can use the recommended configuration file, but should adjust the filenames given inside.

  2. Prepare a lexicon of contemporary word forms. You can use the contemporary datasets provided here for this purpose, and create a lexicon file with the following command (example given for German):

    norma_lexicon -w datasets/modern/combined.de.uniq -a lexicon.de.fsm -l lexicon.de.sym -c
    

    Make sure that the names of the lexicon files match what is given in your norma.cfg before you start training.

Data files for Norma need to be in two-column, tab-separated format. To train a new model, use:

normalize -c norma.cfg -f german-anselm.train.txt -s -t --saveonexit

The names of the saved model files are defined in norma.cfg. Generating normalizations is done via:

normalize -c norma.cfg -f german-anselm.dev.txt -s > german-anselm.predictions

Using Marian

You need to install the Marian framework and clone the normalization-NMT repository on your local machine. You then need to:

  1. Preprocess the input to be in separate source/target files with whitespace-separated characters. This format can be easily generated as follows:

    mkdir preprocessed
    scripts/convert_to_charseq.py german-anselm.{train,test,dev}.txt --to preprocessed
    

    This will create the preprocessed input files (named train.src, train.trg, etc.) in the preprocessed/ subdirectory.

  2. Edit the train_seq2seq.sh script that comes with normalization-NMT to point to the correct paths (for Marian and the preprocessed input), as well as adjust the GPU memory settings and device ID to the correct values for your system. As an example, check out the modified script used for the experiments in Bollmann (2019).

Then, training the model is as simple as calling:

bash train_seq2seq.sh

Generating normalizations is best done by calling marian-decoder directly, like this:

cat preprocessed/dev.src | $MARIAN_PATH/marian-decoder -c $MODELDIR/model.npz.best-perplexity.npz.decoder.yaml -m $MODELDIR/model.npz.best-perplexity.npz --quiet-translation --device 0 --mini-batch 16 --maxi-batch 100 --maxi-batch-sort src -w 10000 --beam-size 5 | sed 's/ //g' > german-anselm.predictions

Marian outputs predictions in the same format as the input files, i.e. with whitespace-separated characters, which is why we pipe it through sed 's/ //g' to obtain the regular representations. You can skip this part, of course, but it's required if you

Related Skills

View on GitHub
GitHub Stars20
CategoryDevelopment
Updated3mo ago
Forks6

Languages

Python

Security Score

72/100

Audited on Dec 16, 2025

No findings