Epic
**Archived** Epic is a high performance statistical parser written in Scala, along with a framework for building complex structured prediction models.
Install / Use
/learn @dlwh/EpicREADME
Archived
NLP, like all of AI, has changed a lot since I wrote this back in 2012-2014. I don't have the time to maintain this library, much less modernize it. Maybe one day...
Epic
(c) 2014 David Hall.
Epic is a structured prediction framework for Scala. It also includes classes for training high-accuracy syntactic parsers, part-of-speech taggers, name entity recognizers, and more.
Epic is distributed under the Apache License, Version 2.0.
The current version is 0.3.
Documentation
Documentation will (eventually) live at the GitHub wiki: https://github.com/dlwh/epic/wiki
See some example usages at https://github.com/dlwh/epic-demo.
Using Epic
Epic can be used programmatically or from the command line, using either pretrained models (see below) or with models you have trained yourself.
Currently, Epic has support for three kinds of models: parsers, sequence labelers, and segmenters. Parsers produce syntactic representations of sentences. Sequence labelers are things like part-of-speech taggers. These associate each word in a sentence with a label. For instance, a part-of-speech tagger can identify nouns, verbs, etc. Segmenters break a sentence into a sequence of fields. For instance, a named entity recognition system might identify all the people, places and things in a sentence.
Command-line Usage
Epic bundles command line interfaces for using parsers, NER systems, and POS taggers (and more generally, segmentation and tagging systems). There are three classes, one for each kind of system:
epic.parser.ParseTextruns a parser.epic.sequences.SegmentTextruns an NER system, or any kind of segmentation system.epic.sequences.TagTextruns a POS tagger, or any kind of tagging system.
All of these systems expect plain text files as input, along with a path to a model file. The syntax is:
java -Xmx4g -cp /path/to/epic-assembly-0.3-SNAPSHOT.jar epic.parser.ParseText --model /path/to/model.ser.gz --nthreads <number of threads> [files]
Currently, all text is output to standard out. In the future, we will support output in a way that differentiates the files. If no files are given, the system will read from standard input. By default, the system will use all available cores for execution.
Models can be downloaded from http://www.scalanlp.org/models/ or from Maven Central. (See below.)
Programmatic Usage
Epic also supports programmatic usage. All of the models assume that text has been segmented and tokenized.
Preprocessing text
To preprocess text so that the models can use them, you will need to segment out sentences and tokenize the sentences into individual words. Epic comes with classes to do both.
Once you have a sentence, you can tokenize it using a epic.preprocess.TreebankTokenizer, which takes a string and returns a sequence of tokens. All told, the pipeline looks like this:
val text = getSomeText();
val sentenceSplitter = MLSentenceSegmenter.bundled().get
val tokenizer = new epic.preprocess.TreebankTokenizer()
val sentences: IndexedSeq[IndexedSeq[String]] = sentenceSplitter(text).map(tokenizer).toIndexedSeq
for(sentence <- sentences) {
// use the sentence tokens
}
Parser
To use the parser programmaticaly, deserialize a parser model--either using epic.models.deserialize[Parser[AnnotatedLabel, String]](path) or using the ParserSelector. Then, give the parser segmented and tokenized text:
val parser = epic.models.deserialize[Parser[AnnotataedLabel, String]](path)
// or:
val parser = epic.models.ParserSelector.loadParser("en").get // or another 2 letter code.
val tree = parser(sentence)
println(tree.render(sentence))
Trees have a number of methods on them. See the class definition or API docs.
Part-of-Speech Tagger
Using a Part-of-Speech tagger is similar to using a parser: load a model, tokenize some text, run the tagger. All taggers are (currently) linear chain conditional random fields, or CRFs. (You don't need to understand them to use them. They are just a machine learning method for assigning a sequence of tags to a sequence of words.)
val tagger = epic.models.deserialize[CRF[AnnotatedLabel, String]](path)
// or:
val tagger = epic.models.PosTagSelector.loadTagger("en").get // or another 2 letter code.
val tags = tagger.bestSequence(sentence)
println(tags.render)
Named Entity Recognition
Using a named entity recognizer is similar to using a pos tagger: load a model, tokenize some text, run the recognizer. All NER systems are (currently) linear chain semi-Markov conditional random fields, or SemiCRFs. (You don't need to understand them to use them. They are just a machine learning method for segmenting text into fields.)
val ner = epic.models.deserialize[SemiCRF[AnnotatedLabel, String]](path)
// or:
val ner = epic.models.NerSelector.loadNer("en").get// or another 2 letter code.
val segments = ner.bestSequence(sentence)
println(segments.render)
The outside label of a SemiCRF is the label that is consider not part of a "real" segment. For instance, in NER, it is the label given to words that are not named entities.
Pre-trained Models
Epic provides a number of pre-trained models. These are available as Maven artifacts from Maven Central, and can be loaded at runtime. To use a specific model, just depend on it (or alternatively download the jar file). You can then load the parser by calling, for example:
epic.parser.models.en.span.EnglishSpanParser.load()
This will load the model and return a Parser object. If you want to not hardwire dependencies, either for internationalization or to potentially try different models, use epic.models.ParserSelector.loadParser(language), where
language is the two letter code for the language you want to use.
To following models are available at this time:
AS OF WRITING ONLY MODELS FOR ENGLISH ARE AVAILABLE! Write me if you want these other models.
- Parser
- English:
"org.scalanlp" %% "epic-parser-en-span" % "2015.1.25"
- English:
- POS Taggers
- English:
"org.scalanlp" %% "epic-pos-en" % "2015.1.25"
- English:
- Named Entity Recognizers
- English:
"org.scalanlp" %% "epic-ner-en-conll" % "2015.1.25"
- English:
There is also a meta-dependency that includes the above three models:
"org.scalanlp" %% "english" % "2015.1.25"
I meant to name that "epic-english" but messed up. So it's that for now. Expect it to change.
TODO:
- Parser
- English:
"org.scalanlp" %% "epic-parser-en-span" % "2014.9.15-SNAPSHOT" - Basque:
"org.scalanlp" %% "epic-parser-eu-span" % "2014.9.15-SNAPSHOT" - French:
"org.scalanlp" %% "epic-parser-fr-span" % "2014.9.15-SNAPSHOT" - German:
"org.scalanlp" %% "epic-parser-de-span" % "2014.9.15-SNAPSHOT" - Hungarian:
"org.scalanlp" %% "epic-parser-hu-span" % "2014.9.15-SNAPSHOT" - Korean:
"org.scalanlp" %% "epic-parser-ko-span" % "2014.9.15-SNAPSHOT" - Polish:
"org.scalanlp" %% "epic-parser-pl-span" % "2014.9.15-SNAPSHOT" - Swedish:
"org.scalanlp" %% "epic-parser-sv-span" % "2014.9.15-SNAPSHOT"
- English:
- POS Taggers
- Basque:
"org.scalanlp" %% "epic-pos-eu" % "2014.9.15-SNAPSHOT" - French:
"org.scalanlp" %% "epic-pos-fr" % "2014.9.15-SNAPSHOT" - German:
"org.scalanlp" %% "epic-pos-de" % "2014.9.15-SNAPSHOT" - Hungarian:
"org.scalanlp" %% "epic-pos-hu" % "2014.9.15-SNAPSHOT" - Polish:
"org.scalanlp" %% "epic-pos-pl" % "2014.9.15-SNAPSHOT" - Swedish:
"org.scalanlp" %% "epic-pos-sv" % "2014.9.15-SNAPSHOT"
- Basque:
- Named Entity Recognizers
- English:
"org.scalanlp" %% "epic-ner-en-conll" % "2014.9.15-SNAPSHOT"
- English:
If you use any of the parser models in research publications, please cite:
David Hall, Greg Durrett, and Dan Klein. 2014. Less Grammar, More Features. In ACL.
If you use the other things, just link to Epic.
Building Epic
In order to do anything besides use pre-trained models, you will probably need to build Epic.
To build, you need a release of SBT 0.13.2
then run
<pre> $ sbt assembly </pre>which will compile everything, run tests, and build a fatjar that includes all dependencies.
Training Models
Training Parsers
There are several different discriminative parsers you can train, and the trainer main class has lots of options. To get a sense of them, run the following command:
<pre> $ java -cp target/scala-2.10/epic-assembly-0.2-SNAPSHOT.jar epic.parser.models.ParserTrainer --help </pre>You'll get a list of all the available options (so many!) The important ones are:
<pre> --treebank.path "path/to/treebank" --cache.path "constraint.cache" --modelFactory XXX # the kind of parser to train. See below. --opt.useStochastic true # turn on stochastic gradient --opt.regularization 1.0 # regularization constant. you need to regularize, badly. </pre>There are 4 kinds of base models you can train, and you can tie them together with an EPParserModel, if you want. The 4 base models are:
- epic.parser.models.LatentModelFactory: Latent annotation (like the Berkeley parser)
- epic.parser.models.LexModelFactory: Lexical annotation (kind of like the Collins parser)
- epic.parser.models.StructModelFactory:
Related Skills
node-connect
335.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
335.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.7kCommit, push, and open a PR
