Epitran

A library and tool for transliterating orthographic text as IPA (International Phonetic Alphabet).

Usage

The Python modules epitran and epitran.vector can be used to easily write more sophisticated Python programs for deploying the Epitran mapping tables, preprocessors, and postprocessors. This is documented below.

If you wish to use Epitran to convert English to IPA, you must install the Flite (including lex_lookup) as detailed below.

Using the `epitran` Module

The Epitran class

The most general functionality in the epitran module is encapsulated in the very simple Epitran class:

Epitran(code, preproc=True, postproc=True, ligatures=False, cedict_file=None).

Its constructor takes one argument, code, the ISO 639-3 code of the language to be transliterated plus a hyphen plus a four letter code for the script (e.g. 'Latn' for Latin script, 'Cyrl' for Cyrillic script, and 'Arab' for a Perso-Arabic script). It also takes optional keyword arguments:

preproc and postproc enable pre- and post-processors. These are enabled by default.
ligatures enables non-standard IPA ligatures like "ʤ" and "ʨ".
cedict_file gives the path to the CC-CEDict dictionary file (relevant only when working with Mandarin Chinese and which, because of licensing restrictions cannot be distributed with Epitran).
tones allows IPA tones (˩˨˧˦˥) to be included and is needed for tonal languages like Vietnamese and Hokkien. By default, this value is false and will remove IPA tones from the transcription.
For more options, type help(epitran.Epitran.__init__) into a Python terminal session

>>> import epitran
>>> epi = epitran.Epitran('uig-Arab')  # Uyghur in Perso-Arabic script

It is now possible to use the Epitran class for English, Mandarin Chinese (Simplified and Traditional) and Cantonese (Traditional) G2P as well as the other langugages that use Epitran's "classic" model. For Chinese and Canonese, it is necessary to point the constructor to a copy of the CC-CEDict dictionary or CC-Canto. E.g.:

>>> import epitran
>>> epi = epitran.Epitran('cmn-Hans', cedict_file='cedict_1_0_ts_utf-8_mdbg.txt')

The most useful public method of the Epitran class is transliterate:

Epitran.transliterate(text, normpunc=False, ligatures=False). Convert text (in Unicode-encoded orthography of the language specified in the constructor) to IPA, which is returned. normpunc enables punctuation normalization and ligatures enables non-standard IPA ligatures like "ʤ" and "ʨ". Usage is illustrated below:

>>> epi.transliterate('Düğün')
'dy\u0270yn'
>>> print(epi.transliterate('Düğün'))
dyɰyn

Epitran.word_to_tuples(word, normpunc=False): Takes a word (a Unicode string) in a supported orthography as input and returns a list of tuples with each tuple corresponding to an IPA segment of the word. The tuples have the following structure:

(
    character_category :: String,
    is_upper :: Integer,
    orthographic_form :: Unicode String,
    phonetic_form :: Unicode String,
    segments :: List<Tuples>
)

Note that word_to_tuples is not implemented for all language-script pairs.

The codes for character_category are from the initial characters of the two character sequences listed in the "General Category" codes found in Chapter 4 of the Unicode Standard. For example, "L" corresponds to letters and "P" corresponds to production marks. The above data structure is likely to change in subsequent versions of the library. The structure of segments is as follows:

(
    segment :: Unicode String,
    vector :: List<Integer>
)

Here is an example of an interaction with word_to_tuples:

>>> import epitran
>>> epi = epitran.Epitran('tur-Latn')
>>> epi.word_to_tuples('Düğün')
[('L', 1, 'D', 'd', [('d', [-1, -1, 1, -1, -1, -1, -1, -1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, 0, -1])]), ('L', 0, 'ü', 'y', [('y', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1])]), ('L', 0, 'ğ', 'ɰ', [('ɰ', [-1, 1, -1, 1, 0, -1, -1, 0, 1, -1, -1, 0, -1, 0, -1, 1, -1, 0, -1, 1, -1])]), ('L', 0, 'ü', 'y', [('y', [1, 1, -1, 1, -1, -1, -1, 0, 1, -1, -1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1])]), ('L', 0, 'n', 'n', [('n', [-1, 1, 1, -1, -1, -1, 1, -1, 1, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, 0, -1])])]

The Backoff class

Sometimes, when parsing text in more than one script, it is useful to employ a graceful backoff. If one language mode does not work, it can be useful to fall back to another, and so on. This functionality is provided by the Backoff class:

Backoff(lang_script_codes, cedict_file=None)

Note that the Backoff class does not currently support parameterized preprocessor and postprocessor application and does not support non-standard ligatures. It also does not support punctuation normalization. lang_script_codes is a list of codes like eng-Latn or hin-Deva. For example, if one was transcribing a Hindi text with many English loanwords and some stray characters of Simplified Chinese, one might use the following code:

from epitran.backoff import Backoff
>>> backoff = Backoff(['hin-Deva', 'eng-Latn', 'cmn-Hans'], cedict_file=‘cedict_1_0_ts_utf-8_mdbg.txt')
>>> backoff.transliterate('हिन्दी')
'ɦindiː'
>>> backoff.transliterate('English')
'ɪŋɡlɪʃ'
>>> backoff.transliterate('中文')
'ʈ͡ʂoŋwən'

Backoff works on a token-by-token basis: tokens that contain mixed scripts will be returned as the empty string, since they cannot be fully converted by any of the modes.

The Backoff class has the following public methods:

transliterate: returns a unicode string of IPA phonemes
trans_list: returns a list of IPA unicode strings, each of which is a phoneme
xsampa_list: returns a list of X-SAMPA (ASCII) strings, each of which is phoneme

Consider the following example:

>>> backoff.transliterate('हिन्दी')
'ɦindiː'
>>> backoff.trans_list('हिन्दी')
['ɦ', 'i', 'n', 'd', 'iː']
>>> backoff.xsampa_list('हिन्दी')
['h\\', 'i', 'n', 'd', 'i:']

DictFirst

The DictFirst class provides a simple alternative to the Backoff class. It requires a dictionary of words known to be of Language A, one word per line in a UTF-8 encoded text file. It accepts three arguments: the language-script code for Language A, that for Language B, and a path to the dictionary file. It has one public method, transliteration, which works like Epitran.transliterate except that it returns the transliteration for Language A if the input token is in the dictionary; otherwise, it returns the Language B transliteration of the token:

>>> import dictfirst
>>> df = dictfirst.DictFirst('tpi-Latn', 'eng-Latn', '../sample-dict.txt')
>>> df.transliterate('pela')
'pela'
>>> df.transliterate('pelo')
'pɛlow'

Preprocessors, postprocessors, and their pitfalls

In order to build a maintainable orthography to phoneme mapper, it is sometimes necessary to employ preprocessors that make contextual substitutions of symbols before text is passed to a orthography-to-IPA mapping system that preserves relationships between input and output characters. This is particularly true of languages with a poor sound-symbols correspondence (like French and English). Languages like French are particularly good targets for this approach because the pronunciation of a given string of letters is highly predictable even though the individual symbols often do not map neatly into sounds. (Sound-symbol correspondence is so poor in English that effective English G2P systems rely heavily on pronouncing dictionaries.)

Preprocessing the inputs words to allow for straightforward grapheme-to-phoneme mappings (as is done in the current version of epitran for some languages) is advantageous because the restricted regular expression language used to write the preprocessing rules is more powerful than the language for the mapping rules and allows the equivalent of many mapping rules to be written with a single rule. Without them, providing epitran support for languages like French and German would not be practical. However, they do present some problems. Specifically, when using a language with a preprocessor, one must be aware that the input word will not always be identical to the concatenation of the orthographic strings (orthographic_form) output by Epitran.word_to_tuples. Instead, the output of word_to_tuple will reflect the output of the preprocessor, which may delete, insert, and change letters in order to allow direct orthography-to-phoneme mapping at the next step. The same is true of other methods that rely on Epitran.word_to_tuple such as VectorsWithIPASpace.word_to_segs from the epitran.vector module.

For information on writing new pre- and post-processors, see the section on "Extending Epitran with map files, preprocessors and postprocessors", below.

Using the `epitran.vector` Module

The epitran.vector module is also very simple. It contains one class, VectorsWithIPASpace, including one method of interest, word_to_segs:

The constructor for VectorsWithIPASpace takes two arguments:

code: the language-script code for the language to be processed.
spaces: the codes for the punctuation/symbol/IPA space in which the characters/segments from the data are expected to reside. The available spaces are listed below.

Its principle method is word_to_segs:

VectorWithIPASpace.word_to_segs(word, normpunc=False). word is a Unicode string. If the keyword argument normpunc is set to True, punctuation discovered in word is normalized to ASCII equivalents.

A typical interaction with the ```V

Epitran

Install / Use

README

Epitran

Usage

Using the `epitran` Module

The Epitran class

The Backoff class

DictFirst

Preprocessors, postprocessors, and their pitfalls

Using the `epitran.vector` Module

Related Skills

Epitran

Install / Use

README

Epitran

Usage

Using the epitran Module

The Epitran class

The Backoff class

DictFirst

Preprocessors, postprocessors, and their pitfalls

Using the epitran.vector Module

Related Skills

Using the `epitran` Module

Using the `epitran.vector` Module