Uctodata
Datafiles for the tokenizer ucto.
Install / Use
/learn @LanguageMachines/UctodataREADME
uctodata 0.10 CLST/ILK 2009 - 2024
https://github.com/LanguageMachines/uctodata/
Website and documentation: https://languagemachines.github.io/ucto
uctodata provides datafiles for the tokeniser ucto for several languages. The
language code can be supplied to ucto using the -L parameter (e.g. ucto -L nld input.txt):
eng- Englisheng-twitter- English twitter textsnld- Dutchnld-historical- Historical Dutch textsnld-twitter- Dutch twitter textsdeu- Germanfra- Frenchita- Italianspa- Spanishpor- Portugueserus- Russianswe- Swedishtur- Turkishfry- Frisian
uctodata is architecture independent.
To install uctodata, first consult whether your distribution's package manager has an up-to-date package.
To compile and install manually from source instead:
$ bash bootstrap.sh
$ ./configure
$ make
$ make install
