Archived

NLP, like all of AI, has changed a lot since I wrote this back in 2012-2014. I don't have the time to maintain this library, much less modernize it. Maybe one day...

Epic

Epic is a structured prediction framework for Scala. It also includes classes for training high-accuracy syntactic parsers, part-of-speech taggers, name entity recognizers, and more.

Epic is distributed under the Apache License, Version 2.0.

The current version is 0.3.

Documentation

Documentation will (eventually) live at the GitHub wiki: https://github.com/dlwh/epic/wiki

See some example usages at https://github.com/dlwh/epic-demo.

Using Epic

Epic can be used programmatically or from the command line, using either pretrained models (see below) or with models you have trained yourself.

Currently, Epic has support for three kinds of models: parsers, sequence labelers, and segmenters. Parsers produce syntactic representations of sentences. Sequence labelers are things like part-of-speech taggers. These associate each word in a sentence with a label. For instance, a part-of-speech tagger can identify nouns, verbs, etc. Segmenters break a sentence into a sequence of fields. For instance, a named entity recognition system might identify all the people, places and things in a sentence.

Command-line Usage

Epic bundles command line interfaces for using parsers, NER systems, and POS taggers (and more generally, segmentation and tagging systems). There are three classes, one for each kind of system:

epic.parser.ParseText runs a parser.
epic.sequences.SegmentText runs an NER system, or any kind of segmentation system.
epic.sequences.TagText runs a POS tagger, or any kind of tagging system.

All of these systems expect plain text files as input, along with a path to a model file. The syntax is:

java -Xmx4g -cp /path/to/epic-assembly-0.3-SNAPSHOT.jar epic.parser.ParseText --model /path/to/model.ser.gz --nthreads <number of threads> [files]

Currently, all text is output to standard out. In the future, we will support output in a way that differentiates the files. If no files are given, the system will read from standard input. By default, the system will use all available cores for execution.

Models can be downloaded from http://www.scalanlp.org/models/ or from Maven Central. (See below.)

Programmatic Usage

Epic also supports programmatic usage. All of the models assume that text has been segmented and tokenized.

Preprocessing text

To preprocess text so that the models can use them, you will need to segment out sentences and tokenize the sentences into individual words. Epic comes with classes to do both.

Once you have a sentence, you can tokenize it using a epic.preprocess.TreebankTokenizer, which takes a string and returns a sequence of tokens. All told, the pipeline looks like this:

val text = getSomeText();

val sentenceSplitter = MLSentenceSegmenter.bundled().get
val tokenizer = new epic.preprocess.TreebankTokenizer()

val sentences: IndexedSeq[IndexedSeq[String]] = sentenceSplitter(text).map(tokenizer).toIndexedSeq

for(sentence <- sentences) {
  // use the sentence tokens
}

Parser

To use the parser programmaticaly, deserialize a parser model--either using epic.models.deserialize[Parser[AnnotatedLabel, String]](path) or using the ParserSelector. Then, give the parser segmented and tokenized text:

val parser = epic.models.deserialize[Parser[AnnotataedLabel, String]](path)

// or:

val parser = epic.models.ParserSelector.loadParser("en").get // or another 2 letter code.

val tree = parser(sentence)

println(tree.render(sentence))

Trees have a number of methods on them. See the class definition or API docs.

Part-of-Speech Tagger

Using a Part-of-Speech tagger is similar to using a parser: load a model, tokenize some text, run the tagger. All taggers are (currently) linear chain conditional random fields, or CRFs. (You don't need to understand them to use them. They are just a machine learning method for assigning a sequence of tags to a sequence of words.)

val tagger = epic.models.deserialize[CRF[AnnotatedLabel, String]](path)

// or:

val tagger = epic.models.PosTagSelector.loadTagger("en").get // or another 2 letter code.

val tags = tagger.bestSequence(sentence)

println(tags.render)

Named Entity Recognition

Using a named entity recognizer is similar to using a pos tagger: load a model, tokenize some text, run the recognizer. All NER systems are (currently) linear chain semi-Markov conditional random fields, or SemiCRFs. (You don't need to understand them to use them. They are just a machine learning method for segmenting text into fields.)

val ner = epic.models.deserialize[SemiCRF[AnnotatedLabel, String]](path)

// or:

val ner = epic.models.NerSelector.loadNer("en").get// or another 2 letter code.

val segments = ner.bestSequence(sentence)

println(segments.render)

The outside label of a SemiCRF is the label that is consider not part of a "real" segment. For instance, in NER, it is the label given to words that are not named entities.

Pre-trained Models

Epic provides a number of pre-trained models. These are available as Maven artifacts from Maven Central, and can be loaded at runtime. To use a specific model, just depend on it (or alternatively download the jar file). You can then load the parser by calling, for example:

epic.parser.models.en.span.EnglishSpanParser.load()

This will load the model and return a Parser object. If you want to not hardwire dependencies, either for internationalization or to potentially try different models, use epic.models.ParserSelector.loadParser(language), where language is the two letter code for the language you want to use.

To following models are available at this time:

AS OF WRITING ONLY MODELS FOR ENGLISH ARE AVAILABLE! Write me if you want these other models.

Parser

English:

"org.scalanlp" %% "epic-parser-en-span" % "2015.1.25"

POS Taggers

English:

"org.scalanlp" %% "epic-pos-en" % "2015.1.25"

Named Entity Recognizers

English:

"org.scalanlp" %% "epic-ner-en-conll" % "2015.1.25"

There is also a meta-dependency that includes the above three models:

"org.scalanlp" %% "english"  % "2015.1.25"

I meant to name that "epic-english" but messed up. So it's that for now. Expect it to change.

TODO:

Parser

English:

"org.scalanlp" %% "epic-parser-en-span" % "2014.9.15-SNAPSHOT"

Basque:

"org.scalanlp" %% "epic-parser-eu-span" % "2014.9.15-SNAPSHOT"

French:

"org.scalanlp" %% "epic-parser-fr-span" % "2014.9.15-SNAPSHOT"

German:

"org.scalanlp" %% "epic-parser-de-span" % "2014.9.15-SNAPSHOT"

Hungarian:

"org.scalanlp" %% "epic-parser-hu-span" % "2014.9.15-SNAPSHOT"

Korean:

"org.scalanlp" %% "epic-parser-ko-span" % "2014.9.15-SNAPSHOT"

Polish:

"org.scalanlp" %% "epic-parser-pl-span" % "2014.9.15-SNAPSHOT"

Swedish:

"org.scalanlp" %% "epic-parser-sv-span" % "2014.9.15-SNAPSHOT"

POS Taggers

Basque:

"org.scalanlp" %% "epic-pos-eu" % "2014.9.15-SNAPSHOT"

French:

"org.scalanlp" %% "epic-pos-fr" % "2014.9.15-SNAPSHOT"

German:

"org.scalanlp" %% "epic-pos-de" % "2014.9.15-SNAPSHOT"

Hungarian:

"org.scalanlp" %% "epic-pos-hu" % "2014.9.15-SNAPSHOT"

Polish:

"org.scalanlp" %% "epic-pos-pl" % "2014.9.15-SNAPSHOT"

Swedish:

"org.scalanlp" %% "epic-pos-sv" % "2014.9.15-SNAPSHOT"

Named Entity Recognizers

English:

"org.scalanlp" %% "epic-ner-en-conll" % "2014.9.15-SNAPSHOT"

If you use any of the parser models in research publications, please cite:

David Hall, Greg Durrett, and Dan Klein. 2014. Less Grammar, More Features. In ACL.

If you use the other things, just link to Epic.

Building Epic

In order to do anything besides use pre-trained models, you will probably need to build Epic.

To build, you need a release of SBT 0.13.2

then run

<pre> $ sbt assembly </pre>

which will compile everything, run tests, and build a fatjar that includes all dependencies.

Training Models

Training Parsers

There are several different discriminative parsers you can train, and the trainer main class has lots of options. To get a sense of them, run the following command:

<pre> $ java -cp target/scala-2.10/epic-assembly-0.2-SNAPSHOT.jar epic.parser.models.ParserTrainer --help </pre>

You'll get a list of all the available options (so many!) The important ones are:

<pre> --treebank.path "path/to/treebank" --cache.path "constraint.cache" --modelFactory XXX # the kind of parser to train. See below. --opt.useStochastic true # turn on stochastic gradient --opt.regularization 1.0 # regularization constant. you need to regularize, badly. </pre>

There are 4 kinds of base models you can train, and you can tie them together with an EPParserModel, if you want. The 4 base models are:

epic.parser.models.LatentModelFactory: Latent annotation (like the Berkeley parser)
epic.parser.models.LexModelFactory: Lexical annotation (kind of like the Collins parser)
epic.parser.models.StructModelFactory:

Epic

Install / Use

README

Archived

Epic

Documentation

Using Epic

Command-line Usage

Programmatic Usage

Preprocessing text

Parser

Part-of-Speech Tagger

Named Entity Recognition

Pre-trained Models

Building Epic

Training Models

Training Parsers

Related Skills