postagga

"But if thought corrupts language, language can also corrupt thought."

George Orwell, 1984

postagga is a suite of tools to assist you in generating efficient and self-contained natural language processors. You can use postagga to process annotated text samples into full fledged parsers capable of understanding "free speech" input as structured data. Ah - and you'll be able to do this easily. You're welcome.

Getting postagga

You can refer postagga as a lib in your Clojure project. Grab it from clojars - in your dependencies in project.clj, just add:

You can also clone the project and walk around the source and models:

git clone https://github.com/turbopape/postagga.git

The models are included under the models folder.

In JVM Clojure, provided you have cloned the repository:

;; ...
 (def fr-model (load-edn "models/fr_tb_v_model.edn")) ;; for French for instance
;; ...

We also shipped two light models as vars defined in namespaces: one for French and one for English. As for JavaScript, the artifacts size are a concern. You can use these models by requiring the two namespaces:

  (ns your-cool.bot
   (: require [postagga.en-fn-v-model :refer [en-model]] ;; for English
              [postagga.fr-tb-v-model :refer [fr-model]])) ;; for French 
   ;; ...

These namespaces make it easy for you to ship parsers for ClojureScript.

You can see an example on how to work with this model, all while making sure your code is compatible across Clojure AND ClojureScript (thanks to readers' conditional) in the Test File.

How does it work?

To do its magic, postagga extracts the phrase structure of your input and tries to find how does this structure compare to its many semantic rules and if it finds a match, where in this structure shall it extract meaningful information.

Let's study a simple example. Look at the next sentence:

"Rafik loves apples"

That is our "Natural language input."

First step in understanding this sentence is to extract some structure from it so it is easier to interpret. One common way to do this is extracting its grammatical phrase structure, which is close enough to what "function" words are actually meant to provide:

Noun Verb Noun

That was the phrase structure analysis, or as we call it POS (Part Of Speech) Tagging. These "Tags" qualify parts of the sentence, as the name implies, and will be used as a hi-fidelity mechanism to write rules for parsers of such phrases.

postagga has tools that enable you to train POS Taggers for any language you want, without relying on external libs. Actually, it does not care about the meaning of the tags at all. However, you should be consistent and clear enough when annotating your input data samples with tags. On one hand, your parser will be more reliable. On the other hand you'll do yourself a great favor maintaining your parser.

Now comes the parser part. Actually, postagga offers a parser that needs semantic rules to be able to map a particular phrase structure into data. In our example, we know that the first Noun depicts a subject carrying out some action. This action is represented by the Verb following it. Finally, the Noun coming after the Verb will undergo this action.

postagga parsers lets you express such rules so it can extract the data for you. You literally tell it to take the first Noun, call it Subject, take the verb, label it action, take the last Noun, call it the Object, finally packaging it all into the following data structure:

{:Subject "Rafik" :Action "Loves" :Object "Apples"}

Naturally, postagga can handle much more complex sentences!

postagga parsers are eventually compiled into self-contained packages, with no single third party dependency. From there it can easily run on servers (Clojure version) and on the browser (ClojureScript). Now your bots can really get what you're trying to tell them!

The postagga workflow

Training a POS Tagger

First of all, you need to train a POS Tagger that can qualify parts of your natural text. postagga relies on Hidden Markov Models, computed with the Viterbi Algorithm. This algorithm makes use of a set of matrices, like what states (the POS Tags) we have, how likely we transition from one POS tag to another, etc...

All of these constitute a model. These are computed out of what we call an annotated text corpus. The postagga.trainer namespace is used create models out of such annotated text corpus. To train a model, make sure you have an annotated corpus like so:

[ ;; A vector of sentences like this one:
[["-" "PONCT"] ["guerre" "NC"] ["d'" "P"] ["indochine" "NPP"]] [["-" "PONCT"] ["colloque" "NC"] ["sur" "P"] ["les" "DET"] ["fraudes" "NC"]] [["-" "PONCT"] ["dernier" "ADJ"] ["résumé" "NC"] [":" "PONCT"] ["l'" "DET"] ["\"" "PONCT"] ["affaire" "NC"] ["des" "P+D"] ["piastres" "NC"] ["\"" "PONCT"]] [["catégories" "NC"] [":" "PONCT"] ["guerre" "NC"] ["d'" "P"] ["indochine" "NPP"] ["." "PONCT"]] [["indochine" "NPP"] ["française" "ADJ"] ["." "PONCT"]] [["quatrième" "ADJ"] ["république" "NC"] ["." "PONCT"]
;; etc...
]

Say you have this corpus - that is: a vector of annotated sentences in a var unsurprisingly named corpus. To train a model, just issue:

(require '[postagga.trainer :refer [train]]

(def model (train corpus)) ;;<- Beware, these can be large vars so avoid realizing all of them like printing in your REPL!!!

We processed one annotated corpus for English:

postagga-fn-en.edn Generated from the Framenet Project

We also processed two annotated corpora for French:

postagga-sequoia-fr.edn Generated from the Sequoia Corpus from INRIA.
postagga-tb-fr.edn Generated from the Free French tree Bank.

We exposed two of these models as Clojure namespaces so you can embed them without using the resource functionality - as it is specific to Clojure(JVM). We chose the two lightest ones to limit the possibility cause network issues:

The suite of tools used to process these two corpora are in the corpuscule project. Please refer to the licensing of these corpora to see what extent you can use work derived from them.

We then trained a model out of the above English corpus:

en_fn_v_model.edn

... and two models out of these two French corpora:

Now you can use that model to assign POS tags to speech:

(Note: sentences must be fed in the form of a vector of all small-case tokens)

(require '[postagga.tagger :refer [viterbi]])

(viterbi model ["je" "suis" "heureux"])
;;=> ["CLS" "V" "ADJ"]

Patching Viterbi's output

When the tagger encounters a word it doesn't know about- that is, was not in the corpus used to generate the viterbi models - it arbitrarily assigns it a tag - more or less randomly picked by the algorithm. To somehow enhance the detection, it is possible to patch the output, that is, look it up in a dictionary of terms of a known type and force the tags accordingly. For instance, if you have a dictionary for proper nouns in a given language, you can patch your HMM generated POS-tags by forcing every word happening to be an entry in this dictionary to have the "NPP" tag.

We provide two dictionaries for proper nouns:

fr_tr_names.cljc for French,
en_tr_names.cljc for English.

You can see how you can integrate patching in the parsing phase hereafter.

Technically, dictionaries are tries to speed up lookup for multiple entries. But this may evolve during time and should be considered as mere details implementation.

Meaning of tags

A reference to the meaning of tags is provided:

For English

Postagga

Install / Use

README