Kaznlp
NLP tools for Kazakh language
Install / Use
/learn @nlacslab/KaznlpREADME
KazNLP: NLP tools for Kazakh language
This project aims at building free/open source language processing tools for Kazakh. The proposed set of tools is designed to tackle a wide range of NLP problems that can be divided into pre-processing, core processing, and application-level routines. It is expected that the tools will be implemented in Python 3 programming language and released under CC-SA-BY or compatible licenses.
If you would like to cite the project in your research or elsewhere, for now, please, cite individual tools separately, using corresponding citing instructions. A single paper, describing the toolkit as a whole is coming soon...
The project is implemented by the computer science lab of the National Laboratory Astana, Nazarbayev University.
Contact info: figure out or run the following code (need python 3.6+ or online interpreter)
frst_name = 'Aibek'
last_name = 'Makazhanov'
sep1, sep2 = '.', '@'
print(f'\n{frst_name.lower()}{sep1}{last_name.lower()}{sep2}nu{sep1}edu{sep1}kz\n')
<hr>
Contents
1. Installation<br>
2. Initial normalization module<br> 2.1 Example usage<br> 2.2 Citing<br>
3. Tokenizers<br> 3.1 Example usage<br> 3.2 Citing<br>
4. Language identification<br> 4.1 Example usage<br> 4.2 Citing<br>
5. Morphological processing<br> 5.1 Analyzer<br> 5.2 Tagger<br> 5.3 Example usage<br> 5.4 Citing<br>
6. References<br>
<hr><a name="ch1"></a> 1. Installation
For software engineering reasons the tools are not yet released as a single PyPI package. Thus for now, there is no installation per se. To use the tools just clone this repository like so:
> git clone https://github.com/nlacslab/kaznlp.git
Alternatively, you can download and unzip the archived version: https://github.com/nlacslab/kaznlp/archive/master.zip
The archive or a cloned repository will contain kaznlp directory, which hosts all the code and models.
In order to correctly import the tools, your own code must be in the same directory with kaznlp (not inside kaznlp).
The current release, contains tutorial.py, which is located in the root of repo, i.e. in the same directory with kaznlp.
To check, if everything is working, you can run this test code like so:
> python tutorial.py
This should generate a lengthy output, which will be discussed in the usage example sections with respect to each tool. For now, just make sure that there are no error messages.
If you encountered problems with the above command, make sure that you are using Python 3.6 or a higher version.
<a name="ch2"></a> 2. Initial normalization module
Noisy User Generated Text (NUGT) is notoriously difficult to process due to prompt introduction of neologisms, e.g. esketit (stands for let’s get it, pronounced [ɛɕˈkerɛ]), and peculiar spelling, e.g. b4 (stands for before). Moreover speakers of more than one language tend to mix them in NUGT (a phenomenon commonly referred to as code-switching) and/or use transliteration (spelling in non-national alphabets). All of this increases lexical variety, thereby aggravating the most prominent problems of CL/NLP, such as out-of-vocabulary lexica and data sparseness.
Kazakhstani segment of Internet is not exempt from NUGT and the following cases are the usual suspects in wreaking the “spelling mayhem”:
-
spontaneous transliteration – switching alphabets, respecting no particular rules or standards, e.g. Kazakh word “біз” (we as pronoun; awl as noun) can be spelled in three additional ways: “биз”, “быз”, and “biz”;
-
use of homoglyphs – interchangeable use of identical or similar looking Latin and Cyrillic alphabets, e.g. Cyrillic letters “е” (U+0435), “с” (U+0441), “і” (U+0456), and “р” (U+0440) in the Kazakh word «есірткі» (drugs) can be replaced with Latin homoglyphs “e” (U+0065), “c” (U+0063), “i” (U+0069), and “p” (U+0070), which, although appear identical, have different Unicode values;
-
code switching – use of Russian words and expressions in Kazakh text and vice versa;
-
word transformations – excessive duplication of letters, e.g. “керемееет” instead of “керемет” (great), or segmentation of words, e.g. “к е р е м е т”.
We propose an approach for initial normalization of Kazakh NUGT. Here an important distinction must be drawn. Unlike with lexical normalization, for initial normalization we do not attempt to recover standard spelling of ill-formed words, in fact, we do not even bother detecting those. All that we really care about at this point is to provide an intermediate representation of the input NUGT that will not necessarily match its lexically normalized version, but will be less sparse. Thus, we aim at improving performance of downstream applications by reducing vocabulary size (effectively, parameter space) and OOV rate.
Our approach amounts to application of the following straightforward procedures:
- noise reduction - removes or replaces "function" symbols (e.g. non-breaking spaces (U+00A0) become regular spaces, while "invisible" spaces (U+200B) get removed) and some other rarely used (in Kaznet) symbols;
- homoglyph resolution - given a mixed script word (i.e. a word with Latin + Cyrillic letters) tries to convert it to a single script token by making apropriate substitutions for homoglyphs;
- transliteration (optional) - translates symbols of the Latin alphabet and national symbols of the Kazakh Cyrillic alphabet into Russian Cyrillic, in our opinion, a common denominator for the three alphabets used in the Kazakh-Russian environment. See [1] for details;
- desegmentation (optional) - joins space-separated segmented words, e.g. "L O V E" becomes "LOVE";
- deduplication (optional) - collapses consecutive occurrences of the same character, e.g. "yesss" becomes "yes";
- emoji resolution (optional) - replaces emojies with their official Unicode descriptions, e.g. ☺ becomes "<emj>smilingface</emj>".
<a name="ch21"></a> 2.1 Examples usage
# =======================
# INITIAL NORMALIZATION =
# =======================
# import initial normalization module
from kaznlp.normalization.ininorm import Normalizer
txt = 'Қайыpлы таӊ! Əнші бaлааапaн ☺️☺️☺️ L O V E 🇰🇿'
ininormer = Normalizer()
# by default only cleaning and script fixing is performed
# returns a tuple: normalized text (a string) and stats (a dictionary)
# the stats dictionary has the following structure:
# stats = {'cleaned': number of "noisy" characters either deleted or replaced,
# 'l2c':number of mix-scripted words converted to all-cyrillic script,
# 'c2l':number of mix-scripted words converted to all-latin script};
(norm_txt, stats) = ininormer.normalize(txt)
print(f'Normalized text: {norm_txt.rjust(39)}')
print(f'Normalization stats: {stats}')
# stats can be omitted by setting [stats] flag to False
# in that case a single string is returned (not a tuple)
norm_txt = ininormer.normalize(txt, stats=False)
# let's compare the texts before and after normalization
voc = lambda x: len(set([c for c in x]))
print(f'\nOriginal text: {txt.rjust(49)}\t(len: {len(txt)}; vocab (chars): {voc(txt)})')
print(f'Normalized text: {norm_txt.rjust(39)}\t(len: {len(norm_txt)}; vocab (chars): {voc(norm_txt)})')
# as we can see, the normalized string is shorter than the original and has fewer unique characters:
Normalized text: Қайырлы таң! Әнші балааапан ☺️☺️☺️ L O V E 🇰🇿
Normalization stats: {'cleaned': 3, 'l2c': 2,
