================ `langid.py` readme

Introduction

langid.py is a standalone Language Identification (LangID) tool.

The design principles are as follows:

Fast
Pre-trained over a large number of languages (currently 97)
Not sensitive to domain-specific features (e.g. HTML/XML markup)
Single .py file with minimal dependencies
Deployable as a web service

All that is required to run langid.py is >= Python 2.7 and numpy.
The main script langid/langid.py is cross-compatible with both Python2 and Python3, but the accompanying training tools are still Python2-only.

langid.py is WSGI-compliant. langid.py will use fapws3 as a web server if available, and default to wsgiref.simple_server otherwise.

langid.py comes pre-trained on 97 languages (ISO 639-1 codes given):

af, am, an, ar, as, az, be, bg, bn, br, 
bs, ca, cs, cy, da, de, dz, el, en, eo, 
es, et, eu, fa, fi, fo, fr, ga, gl, gu, 
he, hi, hr, ht, hu, hy, id, is, it, ja, 
jv, ka, kk, km, kn, ko, ku, ky, la, lb, 
lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, 
nb, ne, nl, nn, no, oc, or, pa, pl, ps, 
pt, qu, ro, ru, rw, se, si, sk, sl, sq, 
sr, sv, sw, ta, te, th, tl, tr, ug, uk, 
ur, vi, vo, wa, xh, zh, zu

The training data was drawn from 5 different sources:

JRC-Acquis
ClueWeb 09
Wikipedia
Reuters RCV2
Debian i18n

Usage

langid.py [options]

Options: -h, --help show this help message and exit -s, --serve launch web service --host=HOST host/ip to bind to --port=PORT port to listen on -v increase verbosity (repeat for greater effect) -m MODEL load model from file -l LANGS, --langs=LANGS comma-separated set of target ISO639 language codes (e.g en,de) -r, --remote auto-detect IP address for remote access -b, --batch specify a list of files on the command line --demo launch an in-browser demo application -d, --dist show full distribution over languages -u URL, --url=URL langid of URL --line process pipes line-by-line rather than as a document -n, --normalize normalize confidence scores to probability values

The simplest way to use langid.py is as a command-line tool, and you can invoke using python langid.py. If you installed langid.py as a Python module (e.g. via pip install langid), you can invoke langid instead of python langid.py -n (the two are equivalent). This will cause a prompt to display. Enter text to identify, and hit enter::

This is a test ('en', -54.41310358047485) Questa e una prova ('it', -35.41771221160889)

langid.py can also detect when the input is redirected (only tested under Linux), and in this case will process until EOF rather than until newline like in interactive mode::

python langid.py < README.rst ('en', -22552.496054649353)

The value returned is the unnormalized probability estimate for the language. Calculating the exact probability estimate is disabled by default, but can be enabled through a flag::

python langid.py -n < README.rst ('en', 1.0)

More details are provided in this README in the section on Probability Normalization.

You can also use langid.py as a Python library::

python

Python 2.7.2+ (default, Oct 4 2011, 20:06:09) [GCC 4.6.1] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import langid langid.classify("This is a test") ('en', -54.41310358047485)

Finally, langid.py can use Python's built-in wsgiref.simple_server (or fapws3 if available) to provide language identification as a web service. To do this, launch python langid.py -s, and access http://localhost:9008/detect . The web service supports GET, POST and PUT. If GET is performed with no data, a simple HTML forms interface is displayed.

The response is generated in JSON, here is an example::

{"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

A utility such as curl can be used to access the web service::

curl -d "q=This is a test" localhost:9008/detect

{"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

You can also use HTTP PUT::

curl -T readme.rst localhost:9008/detect

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                             Dload  Upload   Total   Spent    Left  Speed

100 2871 100 119 100 2752 117 2723 0:00:01 0:00:01 --:--:-- 2727 {"responseData": {"confidence": -22552.496054649353, "language": "en"}, "responseDetails": null, "responseStatus": 200}

If no "q=XXX" key-value pair is present in the HTTP POST payload, langid.py will interpret the entire file as a single query. This allows for redirection via curl::

echo "This is a test" | curl -d @- localhost:9008/detect

{"responseData": {"confidence": -54.41310358047485, "language": "en"}, "responseDetails": null, "responseStatus": 200}

langid.py will attempt to discover the host IP address automatically. Often, this is set to localhost(127.0.1.1), even though the machine has a different external IP address. langid.py can attempt to automatically discover the external IP address. To enable this functionality, start langid.py with the -r flag.

langid.py supports constraining of the output language set using the -l flag and a comma-separated list of ISO639-1 language codes (the -n flag enables probability normalization)::

python langid.py -n -l it,fr

Io non parlo italiano ('it', 0.99999999988965627) Je ne parle pas français ('fr', 1.0) I don't speak english ('it', 0.92210605672341062)

When using langid.py as a library, the set_languages method can be used to constrain the language set::

python
Python 2.7.2+ (default, Oct 4 2011, 20:06:09) [GCC 4.6.1] on linux2 Type "help", "copyright", "credits" or "license" for more information.

import langid langid.classify("I do not speak english") ('en', 0.57133487679900674) langid.set_languages(['de','fr','it']) langid.classify("I do not speak english") ('it', 0.99999835791478453) langid.set_languages(['en','it']) langid.classify("I do not speak english") ('en', 0.99176190378750373)

Batch Mode

langid.py supports batch mode processing, which can be invoked with the -b flag. In this mode, langid.py reads a list of paths to files to classify as arguments. If no arguments are supplied, langid.py reads the list of paths from stdin, this is useful for using langid.py with UNIX utilities such as find.

In batch mode, langid.py uses multiprocessing to invoke multiple instances of the classifier, utilizing all available CPUs to classify documents in parallel.

.. Probability Normalization

Probability Normalization

The probabilistic model implemented by langid.py involves the multiplication of a large number of probabilities. For computational reasons, the actual calculations are implemented in the log-probability space (a common numerical technique for dealing with vanishingly small probabilities). One side-effect of this is that it is not necessary to compute a full probability in order to determine the most probable language in a set of candidate languages. However, users sometimes find it helpful to have a "confidence" score for the probability prediction. Thus, langid.py implements a re-normalization that produces an output in the 0-1 range.

langid.py disables probability normalization by default. For command-line usages of langid.py, it can be enabled by passing the -n flag. For probability normalization in library use, the user must instantiate their own LanguageIdentifier. An example of such usage is as follows::

from langid.langid import LanguageIdentifier, model identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True) identifier.classify("This is a test") ('en', 0.9999999909903544)

Training a model

We provide a full set of training tools to train a model for langid.py on user-supplied data. The system is parallelized to fully utilize modern multiprocessor machines, using a sharding technique similar to MapReduce to allow parallelization while running in constant memory.

The full training can be performed using the tool train.py. For research purposes, the process has been broken down into indiviual steps, and command-line drivers for each step are provided. This allows the user to inspect the intermediates produced, and also allows for some parameter tuning without repeating some of the more expensive steps in the computation. By far the most expensive step is the computation of information gain, which will make up more than 90% of the total computation time.

The tools are:

index.py - index a corpus. Produce a list of file, corpus, language pairs.
tokenize.py - take an index and tokenize the corresponding files
DFfeatureselect.py - choose features by document frequency
IGweight.py - compute the IG weights for language and for domain
LDfeatureselect.py - take the IG weights and use them to select a feature set
scanner.py - build a scanner on the basis of a feature set
NBtrain.py - learn NB parameters using an indexed corpus and a scanner

The tools can be found in langid/train subfolder.

Each tool can be called with --help as the only parameter to provide an overview of the functionality.

To train a model, we require multiple corpora of monolingual documents. Each document should be a sing

Langid.py

Install / Use

README

================ `langid.py` readme

Introduction

Usage

python

curl -d "q=This is a test" localhost:9008/detect

curl -T readme.rst localhost:9008/detect

echo "This is a test" | curl -d @- localhost:9008/detect

python langid.py -n -l it,fr

Batch Mode

Probability Normalization

Training a model

Langid.py

Install / Use

README

================ langid.py readme

Introduction

Usage

python

curl -d "q=This is a test" localhost:9008/detect

curl -T readme.rst localhost:9008/detect

echo "This is a test" | curl -d @- localhost:9008/detect

python langid.py -n -l it,fr

Batch Mode

Probability Normalization

Training a model

================ `langid.py` readme