Histnorm
Compiled tools, datasets, and other resources for historical text normalization.
Install / Use
/learn @coastalcph/HistnormREADME
Historical Text Normalization
Compiled tools, datasets, and other resources for historical text normalization.
The resources provided here have originally been published along with the following publication:
- Marcel Bollmann. 2019. A Large-Scale Comparison of Historical Text Normalization Systems. In Proceedings of NAACL-HLT 2019.
@inproceedings{bollmann2019-largescale,
author = {Bollmann, Marcel},
title = {A Large-Scale Comparison of Historical Text Normalization Systems},
booktitle = {Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},
location = {Minneapolis, Minnesota},
publisher = {Association for Computational Linguistics},
year = {2019},
pages = {3885--3898},
url = {http://www.aclweb.org/anthology/N19-1389},
}
If you use an original part of this repository (such as the provided scripts or the previously unpublished dataset splits), I would appreciate if you cite the above-mentioned publication. If you use one of the referenced datasets and/or tools, please remember to (also) cite these accordingly.
For further reading, a lot of additional details and background information can also be found in:
- Marcel Bollmann. 2018. Normalization of Historical Texts with Neural Network Models. Bochumer Linguistische Arbeitsberichte, 22.
The following additional material is also available:
- Table 6 with raw accuracy numbers for the plots provided in Figures 3 and 4 in Bollmann (2019).
Datasets
Language | Source Corpus | Time Period | Genre | Tokens (total) | Source of Splits | ----------- | -------------------------------------------------------------------------------------------------------------------- | ------------ | ---------------- | -------------- | ---------------------------------------------------------- English¹ | ICAMET | 1386-1698 | Letters | 188,158 | HistCorp German | Anselm | 14th-16th c. | Religion | 71,570 | prev. unpublished German | RIDGES | 1482-1652 | Science | 71,570 | prev. unpublished Hungarian | HGDS | 1440-1541 | Religion | 172,064 | HistCorp Icelandic | IcePaHC | 15th c. | Religion | 65,267 | HistCorp Portuguese | Post Scriptum | 15th-19th c. | Letters | 306,946 | prev. unpublished Slovene | goo300k | 1750-1899 | Mixed | 326,538 | KonvNormSl 1.0 Spanish | Post Scriptum | 15th-19th c. | Letters | 132,248 | prev. unpublished Swedish | GaW | 1527-1812 | Official Records | 65,571 | HistCorp
¹Due to licensing restrictions, the ICAMET dataset may not be distributed further, but the HistCorp website contains instructions on how to obtain the same dataset splits.
Helpful Scripts
The scripts/ directory contains a collection of scripts that was
used in the process of running the normalization experiments in Bollmann (2019).
This includes preprocessing scripts, evaluation and significance
testing scripts, and more. For more details, please see the README file in
the scripts/ folder.
TL;DR: The Recommended Normalization Approach
In most cases, you want to combine a naive memorization baseline (for in-vocabulary tokens) with a good, trained model (for out-of-vocabulary tokens).
-
If you have little training data (<500 tokens), you probably want to use the Norma tool (which already includes a naive memorization component); see "Using Norma" for details.
-
Otherwise, the results from Bollmann (2019) suggest using cSMTiser as the trained model in this scenario; see below under "Using cSMTiser".
The naive memorization component can be trained as follows:
scripts/memorizer.py train german-lexicon.txt german-anselm.train.txt
Apply it via:
scripts/memorizer.py apply german-lexicon.txt german-anselm.dev.txt > dev.memo.pred
To combine naive memorization with a trained model (for out-of-vocabulary
tokens), first train and apply one of the normalizers discussed below. If the
predictions of that trained model are in dev.model.pred, you can then apply
this combined strategy via:
scripts/memorizer.py combine german-lexicon.txt dev.model.pred german-anselm.dev.txt
This will output a new prediction file that returns the learned memorization if
possible, and the corresponding line from dev.model.pred otherwise.
Tools
The following tools are evaluated in Bollmann (2019):
- Norma, described in Bollmann (2012)
- Marian (NMT) for normalization, described in Tang et al. (2018)
- XNMT, following the model of Bollmann (2018)
- cSMTiser (wrapping Moses)
The detailed instructions below assume that the data files are provided in the same format as contained in this repository; i.e., as tab-separated text files where the first column contains a historical word form and the second column contains its normalization.
Using Norma
Norma (and at least one of its dependencies) needs to be compiled manually on your system before it can be used. Detailed instructions for this can be found in the Norma repository.
To use Norma, you need to:
-
Prepare a configuration file; you can use the recommended configuration file, but should adjust the filenames given inside.
-
Prepare a lexicon of contemporary word forms. You can use the contemporary datasets provided here for this purpose, and create a lexicon file with the following command (example given for German):
norma_lexicon -w datasets/modern/combined.de.uniq -a lexicon.de.fsm -l lexicon.de.sym -cMake sure that the names of the lexicon files match what is given in your
norma.cfgbefore you start training.
Data files for Norma need to be in two-column, tab-separated format. To train a new model, use:
normalize -c norma.cfg -f german-anselm.train.txt -s -t --saveonexit
The names of the saved model files are defined in norma.cfg. Generating
normalizations is done via:
normalize -c norma.cfg -f german-anselm.dev.txt -s > german-anselm.predictions
Using Marian
You need to install the Marian framework and clone the normalization-NMT repository on your local machine. You then need to:
-
Preprocess the input to be in separate source/target files with whitespace-separated characters. This format can be easily generated as follows:
mkdir preprocessed scripts/convert_to_charseq.py german-anselm.{train,test,dev}.txt --to preprocessedThis will create the preprocessed input files (named
train.src,train.trg, etc.) in thepreprocessed/subdirectory. -
Edit the
train_seq2seq.shscript that comes with normalization-NMT to point to the correct paths (for Marian and the preprocessed input), as well as adjust the GPU memory settings and device ID to the correct values for your system. As an example, check out the modified script used for the experiments in Bollmann (2019).
Then, training the model is as simple as calling:
bash train_seq2seq.sh
Generating normalizations is best done by calling marian-decoder directly,
like this:
cat preprocessed/dev.src | $MARIAN_PATH/marian-decoder -c $MODELDIR/model.npz.best-perplexity.npz.decoder.yaml -m $MODELDIR/model.npz.best-perplexity.npz --quiet-translation --device 0 --mini-batch 16 --maxi-batch 100 --maxi-batch-sort src -w 10000 --beam-size 5 | sed 's/ //g' > german-anselm.predictions
Marian outputs predictions in the same format as the input files, i.e. with
whitespace-separated characters, which is why we pipe it through sed 's/ //g'
to obtain the regular representations. You can skip this part, of course, but
it's required if you
Related Skills
node-connect
350.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
350.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
350.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
