SkillAgentSearch skills...

HornMorpho

Morphological processing for languages of the Horn of Africa

Install / Use

/learn @hltdi/HornMorpho
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

=========================== HornMorpho, version 5.3.4

Sept. 14, 2025


Introduction

HornMorpho (HM) is a Python program that performs morphological analysis and generation for various languages of the Horn of Africa. The languages supported in Version 5.3 are Amharic (አማርኛ), Oromo (Afaan Oromoo, Oromiffa), Tigrinya (Tigrigna, ትግርኛ), and Tigre (ትግሬ, ትግራይት). Most examples within this document are Amharic; future versions will include more examples from the other languages.

If your application can benefit from explicit linguistic information about the structure and grammatical properties of words in these languages, then you may want to use HM. HM can tell you, for example, that the verb የማይደረገው is negative, that the noun አባቴን is the object of some verb, that the stem (the word without prefixes and suffixes) of the verb የምንፈልጋቸው is -ፈልግ-, that the lemma (basic form) of the verb እንደሚመኟቸው is ተመኘ, that is, that this verb has something to do with ‘longing’. HM can also tell you that the word እንደሚመኟቸው consists of five segments (morphemes): እንደም+ይ+መኝ+ኡ+ኣቸው.

HM is a rule-based program; that is, the knowledge in the program is based on explicit linguistic rules and a lexicon, a dictionary of basic word forms (stems and roots), rather than on machine learning of the knowledge from a corpus.

  • For Amharic, the lexicon is extracted mainly from Amsalu Aklilu’s Amharic-English Dictionary (Addis Ababa, Kuraz, 2004). The rules come from many grammars of the language.
  • For Tigrinya, the lexicon is from Thomas Leiper Kane’s Tigrinya-English Dictionary (Kensington, MD, USA, Dunwoody Press, 2000). The rules come mainly from Wolf Leslau’s Documents Tigrigna, Grammaire et Textes (Paris, Librarie C. Klincksieck, 1941) and Amanuel Sahle’s ሰዋስው ትግርኛ ብሰፊሕ (Lawrenceville, NJ, Red Sea Press, 1998).
  • For Oromo, the lexicon is from two dictionaries, Gene B. Gragg’s Oromo Dictionary (African Studies Center, Michigan State University, 1982) and Tamene Bitima’s A Dictionary of Oromo Technical Terms (Oromo-English) (Rüdiger Köppe, Köln, 2000). The rules come mainly from Catherine Griefenow-Mewis’s A Grammatical Sketch of Written Oromo (Köln, Rüdiger Köppe Verlag, 2001).
  • For Tigre, all of the words and rules are from the Mansa` dialect of the language. The lexicon is still quite limited, containing only several hundred noun and adjective roots and 86 verb roots. The roots are taken from Saleh Mahmud Idris’s A Comparative Study of the Tigrinya Dialects (Aachen, Shaker [Semitica et Semiohamitica Berolinensia 18], 2015) and from Shlomo Raz’s Tigre Grammar and Texts (Malibu, CA, USA, Undena Publications, 1983). The rules come from Raz.

Though HM does not make use of machine learning, it is possible to use its output in models that do. For example, Gezmu & Nürnberger (2023) <https://dl.acm.org/doi/10.1145/3610773>__ uses HM’s segmentation of Amharic words for neural machine translation.

HM assigns a part-of-speech (POS) to each word, but if you want a POS tagger, you should look elsewhere. A word’s POS often depends on the other words in the sentence in which it occurs, and HM analyzes words without looking at their context.

HM has a list of Amharic person and place names, but if you want named entity recognition, you should look for a program that has been trained to do this. If a name is not in HM’s list for Amharic, it will just be treated as an unknown word, and this will be true for almost all names in Tigrinya, Oromo, and Tigre.

Version 5 replaces Version 4.5 for Amharic. For other languages, see Version 4.3. Version 5 is not backward compatible with earlier versions. If you have used earlier versions of HM and would like to switch to Version 5, please contact gasser@iu.edu for help.


Installation

It is highly recommended that you install the program in a virtual environment <https://realpython.com/python-virtual-environments-a-primer/>__, but this is not required. If you are using a virtual environment, you will need to create the environment and activate it before running pip install.

First download the wheel file from the dist/ folder: HornMorpho-5.3.4-py3-none-any.whl <https://github.com/hltdi/HornMorpho/blob/master/dist/HornMorpho-5.3.4-py3-none-any.whl>__

Then, to install from the wheel file, do the following in a Python shell from the folder where the wheel file is

::

pip install HornMorpho-5.3.4-py3-none-any.whl

If this fails, it may mean that you don’t have wheel <https://pypi.org/project/wheel/>__ installed, so try again after installing wheel.

Then to use the program, in a Python shell, do

import hm

The first time you use HornMorpho, you will need to download the data for the languages that you will be using. Each language’s data is stored in a compressed .tgz archive. To download a language’s archive, do this

hm.download(language)

where language is 'a' for Amharic, 't' for Tigrinya, 'o' for Oromo, or 'te' for Tigre. This will download the compressed file from the HornMorpho Github repository and then uncompress it. If you try to use any of the functions described below without first downloading the data for the relevant language, you will be prompted to download the data.

If you have problems with installation, contact gasser@iu.edu.


Quickstart

If you aren’t interested in learning more about what HM can do and just want to use it to analyze the words in a corpus of sentences, this section has the minimum that you’ll need to know.

To analyze the words in a corpus, use the function anal_corpus(), passing the sentences as a list of strings, using the keyword data, or as a path to a file containing the sentences, using the keyword path.

::

(1)

c = hm.anal_corpus('a', data=["በሶ የበላው አበበ አይደለም ።", "ጫላ ጩቤዬን ጨብጧል ።"])

This returns an instance of the class Corpus, which has a write() method that you can call to write the analyses to a file, using the keyword path, or to standard output if you specify no path. You can tell which word attributes you want to write with the keyword attribs. Some possible attributes are part-of-speech ('pos'), morphological features ('um'), segmentation into morphemes ('seg'), and lemma ('lemma').

::

(2)

c.write(attribs=['pos', 'um', 'lemma']) በሶ የበላው አበበ አይደለም ። በሶ N SG በሶ የበላው V *RELC;3;DEF;MASC;PFV;SG በላ አበበ V 3;MASC;PFV;SG አበበ አይደለም COP 3;MASC;NEG;PRS;SG ነው ። PUNCT

ጫላ ጩቤዬን ጨብጧል ። ጫላ PROPN SG ጫላ ጩቤዬን N ACC;PSS1S;SG ጩቤ ጨብጧል V 3;MASC;PRF;SG ጨበጠ ። PUNCT


Overview of the program

HM is a rule-based morphological analyzer and generator, implemented in the form of finite-state transducers weighted with feature structures. For the theory behind the program, see Gasser (2011) <https://www.researchgate.net/publication/228910448_HornMorpho_a_system_for_morphological_processing_of_Amharic_Oromo_and_Tigrinya>__.

Most users of HM will be interested in morphological analysis. The program also works in the opposite direction, performing morphological generation, taking as input the root and grammatical features of a word and returning the word form. Documentation of the generation functions is forthcoming.

The simplest HM function, anal, takes a word and returns an instance of the Word class. An HM Word is a list of Python dict\ s, each representing a separate analysis of the input word. [1]_ You can use the usual Python ways of accessing the elements in a list or dict. For example, here is how you would analyze the Amharic word የቤታችን. The first argument to anal specifies the language; 'a' is Amharic, 't' Tigrinya, 'o' Oromo, 'te' Tigre.

::

(3)

w = hm.anal('a', "የቤታችን")

The keys in the dict for an analysis of a word represent different pieces of information that you may be interested in. For example, you may want the lemma of the input word. This is the basic form of the word. For nouns in all of the languages, this is the stem of the word without any prefixes or suffixes. Here’s how you’d get the lemma for the above analysis of the word የቤታችን. w[0] returns the first analysis dict in the list of analyses, and w[0]['lemma'] returns the value associated with the keyword lemma in this dict. [2]_

::

(4)
>>> w[0]['lemma']
>>> 'ቤት'

Other dict keys are described below <#keywords>__.

You will probably not want to use HM to analyze individual words, as in the above example. There are also functions for analyzing sentences and corpora of sentences, anal_sentence() <#anal_sentence>__ and anal_corpus() <#anal_corpus>__, described below. These functions call anal() on the words in the sentences.

Morphological segmentation

Morphemes ^^^^^^^^^

A morphologically complex word consists of multiple morphemes, that is, more than one meaningful unit. One morpheme, the stem, is the part that conveys the basic meaning (the lexical meaning) of the word. The other morphemes, those that appear before the stem (as prefixes), after the stem (as suffixes) or within the stem (as infixes), modify the lexical meaning in various ways. For example, the Amharic word ለቤቶቻችን ‘for our houses’ consists of the stem ቤት and three additional morphemes, the prefix ለ- and the suffixes -ኦች and -ኣችን. [3]_

Segmentation ^^^^^^^^^^^^

A morphological segmentation of a word consists of a representation of the sequence of morphemes that make up the word. Morphological segmentation may be useful in NLP applications that make use of subword units, for example, language models. In these cases it provides

Related Skills

View on GitHub
GitHub Stars57
CategoryDevelopment
Updated9d ago
Forks18

Languages

Lex

Security Score

95/100

Audited on Mar 20, 2026

No findings