Release PyPI

Tokenizer: A tokenizer for Icelandic text

Overview

Tokenization is a necessary first step in many natural language processing tasks, such as word counting, parsing, spell checking, corpus generation, and statistical analysis of text.

Tokenizer is a compact pure-Python (>=3.9) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences.

The package contains a dictionary of common Icelandic abbreviations, in the file src/tokenizer/Abbrev.conf.

Tokenizer is an independent spinoff from the Greynir project (GitHub repository here), by the same authors. The Greynir natural language parser for Icelandic uses Tokenizer on its input.

Tokenizer is licensed under the MIT license.

Indicative performance

Time to tokenize 1 MB of a wide selection of texts from the Icelandic Gigaword Corpus using a 64-bit 2.6 GHz Intel Core i9:

| | Time (sec) | |---------------|------------:| | CPython 3.12 | 25.27 | | PyPy 3.11 | 8.08 |

Running tokenization with PyPy is about 3x faster than with CPython.

Deep vs. shallow tokenization

Tokenizer can do both deep and shallow tokenization.

Shallow tokenization simply returns each sentence as a string (or as a line of text in an output file), where the individual tokens are separated by spaces.

Deep tokenization returns token objects that have been annotated with the token type and further information extracted from the token, for example a (year, month, day) tuple in the case of date tokens.

In shallow tokenization, tokens are in most cases kept intact, although consecutive white space is always coalesced. The input strings "800 MW", "21. janúar" and "800 7000" thus become two tokens each, output with a single space between them.

In deep tokenization, the same strings are represented by single token objects, of type TOK.MEASUREMENT, TOK.DATEREL and TOK.TELNO, respectively. The text associated with a single token object may contain spaces, although consecutive whitespace is always coalesced into a single space " ".

By default, the command line tool performs shallow tokenization. If you want deep tokenization with the command line tool, use the --json or --csv switches.

From Python code, call split_into_sentences() for shallow tokenization, or tokenize() for deep tokenization. These functions are documented with examples below.

Installation

To install:

$ pip install tokenizer

Command line tool

After installation, the tokenizer can be invoked directly from the command line:

$ tokenize input.txt output.txt

Input and output files are assumed to be UTF-8 encoded. If the file names are not given explicitly, stdin and stdout are used for input and output, respectively.

Empty lines in the input are treated as hard sentence boundaries.

By default, the output consists of one sentence per line, where each line ends with a single newline character (ASCII LF, chr(10), \n). Within each line, tokens are separated by spaces.

The following (mutually exclusive) options can be specified on the command line:

| Option | Description | |-------------|-----------------------------------------------------------| | --csv | Deep tokenization. Output token objects in CSV format, one per line. Each line contains: token kind (number), normalized text, value (if applicable), original text with preserved whitespace, and character span indices. Sentences are separated by lines containing 0,"","","","". | | --json | Deep tokenization. Output token objects in JSON format, one per line. Each JSON object contains: k (token kind), t (normalized text), v (value if applicable), o (original text with preserved whitespace), s (character span indices). |

Other options can be specified on the command line:

| Option | Description | |------------------------------|-----------------------------------------------------------| | -n, --normalize | Normalize punctuation: quotes output in Icelandic form („these“), ellipsis as single character (…), year ranges with en-dash (1914–1918), and em-dashes centered with spaces ( — ). This option is only applicable to shallow tokenization. | | -s, --one_sent_per_line | Input contains strictly one sentence per line, i.e. every newline is a sentence boundary. | | -o, --original | Output original token text, i.e. bypass shallow tokenization. This effectively runs the tokenizer as a sentence splitter only. | | -m, --convert_measurements | Degree signal in tokens denoting temperature normalized (200° C -> 200 °C). | | -p, --coalesce_percent | Numbers combined into one token with the following token denoting percentage word forms (prósent, prósentustig, hundraðshlutar). | | -g, --keep_composite_glyphs | Do not replace composite glyphs using Unicode COMBINING codes with their accented/umlaut counterparts. | | -e, --replace_html_escapes | HTML escape codes replaced by their meaning, such as á -> á. | | -c, --convert_numbers | English-style decimal points and thousands separators in numbers changed to Icelandic style. |

Type tokenize -h or tokenize --help to get a short help message.

Example

$ echo "3.janúar sl. keypti   ég 64kWst rafbíl. Hann kostaði € 30.000." | tokenize
3. janúar sl. keypti ég 64kWst rafbíl .
Hann kostaði €30.000 .

$ echo "3.janúar sl. keypti   ég 64kWst rafbíl. Hann kostaði € 30.000." | tokenize --csv
19,"3. janúar","0|1|3","3.janúar","0-1-2-2-3-4-5-6-7"
6,"sl.","síðastliðinn"," sl.","1-2-3"
6,"keypti",""," keypti","1-2-3-4-5-6"
6,"ég","","   ég","3-4"
22,"64kWst","J|230400000.0"," 64kWst","1-2-3-4-5-6"
6,"rafbíl",""," rafbíl","1-2-3-4-5-6"
1,".",".",".","0"
0,"","","",""
6,"Hann",""," Hann","1-2-3-4"
6,"kostaði",""," kostaði","1-2-3-4-5-6-7"
13,"€30.000","30000|EUR"," € 30.000","1-3-4-5-6-7-8"
1,".",".",".","0"
0,"","","",""

$ echo "3.janúar sl. keypti   ég 64kWst rafbíl. Hann kostaði € 30.000." | tokenize --json
{"k":"BEGIN SENT"}
{"k":"DATEREL","t":"3. janúar","v":[0,1,3],"o":"3.janúar","s":[0,1,2,2,3,4,5,6,7]}
{"k":"WORD","t":"sl.","v":["síðastliðinn"],"o":" sl.","s":[1,2,3]}
{"k":"WORD","t":"keypti","o":" keypti","s":[1,2,3,4,5,6]}
{"k":"WORD","t":"ég","o":"   ég","s":[3,4]}
{"k":"MEASUREMENT","t":"64kWst","v":["J",230400000.0],"o":" 64kWst","s":[1,2,3,4,5,6]}
{"k":"WORD","t":"rafbíl","o":" rafbíl","s":[1,2,3,4,5,6]}
{"k":"PUNCTUATION","t":".","v":".","o":".","s":[0]}
{"k":"END SENT"}
{"k":"BEGIN SENT"}
{"k":"WORD","t":"Hann","o":" Hann","s":[1,2,3,4]}
{"k":"WORD","t":"kostaði","o":" kostaði","s":[1,2,3,4,5,6,7]}
{"k":"AMOUNT","t":"€30.000","v":[30000,"EUR"],"o":" € 30.000","s":[1,3,4,5,6,7,8]}
{"k":"PUNCTUATION","t":".","v":".","o":".","s":[0]}
{"k":"END SENT"}

CSV Output Format

When using --csv, each token is output as a CSV row with the following five fields:

Token kind (number): Numeric code representing the token type (e.g., 6 for WORD, 19 for DATEREL, 1 for PUNCTUATION)
Normalized text: The processed text of the token
Value: The parsed value, if applicable (e.g., date tuples, amounts, abbreviation expansions), or empty string
Original text: The original text including preserved whitespace
Span indices: Character indices mapping each character in the normalized text to its position in the original text, separated by hyphens

Sentences are separated by rows containing 0,"","","","".

JSON Output Format

When using --json, each token is output as a JSON object on a separate line with the following fields:

k (kind): The token type description (e.g., "WORD", "DATEREL", "PUNCTUATION")
t (text): The normalized/processed text of the token
v (value): The parsed value, if applicable (e.g., date tuples, amounts, abbreviation expansions)
o (original): The original text including preserved whitespace
s (span): Character indices mapping each character in the normalized text to its position in the original text

Python module

Shallow tokenization example

An example of shallow tokenization from Python code goes something like this:

from tokenizer import split_into_sentences

# A string to be tokenized, containing two sentences
s = "3.janúar sl. keypti   ég 64kWst rafbíl. Hann kostaði € 30.000."

# Obtain a generator of sentence strings
g = split_into_sentences(s)

# Loop through the sentences
for sentence in g:

    # Obtain the individual token strings
    tokens = sentence.split()

    # Print the tokens, comma-separated
    print("|".join(tokens))

The program outputs:

3.|janúar|sl.|keypti|ég|64kWst|rafbíl|.
Hann|kostaði|€30.000|.

Deep tokenization example

To do deep tokenization from within Python code:

from tokenizer import tokenize, TOK

text = ("Málinu var vísað til stjórnskipunar- og eftirlitsnefndar "
    "skv. 3. gr. XVII. kafla laga nr. 10/2007 þann 3. janúar 2010.")

for token in tokenize(text):

    pri

Tokenizer

Install / Use

README