MBSP
Memory-based shallow parser for Python
Install / Use
/learn @clips/MBSPREADME
MBSP is a text analysis system based on the TiMBL and MBT memory based learning applications developed at CLiPS and ILK. It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding and Prepositional Phrase Attachment. The general English version of MBSP has been trained on data from the Wall Street Journal corpus.

Download
<table> <colgroup> <col style="width: 33%" /> <col style="width: 33%" /> <col style="width: 33%" /> </colgroup> <tbody> <tr class="odd"> <td><a href="/media/MBSP_1.4.zip"><img src="/sites/all/themes/clips/g/download.gif" alt="download" /></a></td> <td><strong>MBSP for Python (1.4)</strong> | <a href="/media/MBSP_1.4.zip">download</a> (.zip, 24MB)<br /> <ul> <li>Requires: Python 2.5+ on Unix | Mac | Cygwin.</li> <li>Licensed under <a href="http://www.gnu.org/licenses/gpl.html">GPL</a></li> <li>Releases: <a href="/media/MBSP_1.4.zip">1.4</a> | <a href="/media/MBSP_1.3.zip">1.3</a> | <a href="/media/MBSP_1.2.zip">1.2</a> | <a href="/media/MBSP_1.1.zip">1.1</a></li> <li>Authors:<br /> Vincent Van Asch (<em>vincent.vanasch at uantwerpen.be</em>),<br /> Tom De Smedt (<em>tom at organisms.be</em>)</li> </ul> <p><span style="text-decoration: underline;">Reference</span>: Daelemans, W., & Van den Bosch, A. (2005).<br /> <em>Memory-based language processing</em>. Cambridge University Press, Cambridge, UK.<br /> ISBN-13: 9780521808903 | ISBN-10: 0521808901</p> <p><span class="grey" style="text-decoration: underline;">SHA256</span><span class="grey"> checksum of the .zip:</span><br /> <span class="grey">62be10ece640404058b607d9f15493a2a98d55b7e3a3d9becc6ed5854fa800bc</span> </p></td> <td><a href="http://twitter.com/share" class="twitter-share-button">Tweet</a></td> </tr> </tbody> </table>Documentation
- Introduction
- Installation instructions
- The parser
- The tokenizer
- The lemmatizer
- The PP-attacher
- Parse trees
- Clients and servers
- Configuration
- Command-line interface
- Extending MBSP
- Exporting to XML, NLTK, GraphViz
- Licensing
<span id="introduction"></span>Introduction
Quick overview
MBSP parses a string of characters into words and sentences, and determines the grammatical structure of the sentence. It is a Python module, so you'll need Python to run it (already installed on Mac OS X).
The module uses a client-server architecture for performance. It includes binaries (TiMBL, MBT and MBLEM) precompiled for Mac OS X, so on Mac it works out-of-the-box. Otherwise, if you're on a Unix system, the module has a setup.py file that should compile everything for you. Go to the terminal and type:
cd MBSP
python setup.py
If that doesn't work you'll need to follow the steps in the installation instructions.
Put the MBSP folder in the same folder as your Python script and import the module. By default, the servers are configured to start automatically. Once they are up and running you can use the parse() command to analyze texts:
import MBSP
print MBSP.parse('cats with hats')
>>> cats/NNS/I-NP/O/O/A1/cat with/IN/I-PP/B-PNP/O/P1/with hats/NNS/I-NP/I-PNP/O/P1/hat
Each word has been tagged with grammatical information. For example, MBSP determined that cats is a plural noun (<span class="postag">NNS</span>). It has a prepositional noun phrase (<span class="postag">PNP</span>) attached to it (<span class="postag">A1</span> is the anchor of <span class="postag">P1</span>), so the hats go with the cats. For a human this might seem pretty straightforward, but consider that without any analysis, for a machine the sentence is just a sequence of characters with no meaning.
The tag codes may seem cryptic at first, but consider that it is more concise to say <span class="postag">NNS</span> than <span class="postag">PLURAL NOUN</span> over and over. The tag codes are common in natural language processing, it's a good idea to get acquainted with them.
Something went wrong? Probably the servers didn't have enough time to start:
MBSP.start(timeout=120)
print(MBSP.parse('cats with hats'))
The output of the parse() command is a tagged string that can be manipulated in many ways.
With the split() command it can be transformed into a tree of linked Python objects:
s = MBSP.parse('black cats with striped hats')
s = MBSP.split(s)
for sentence in s:
for chunk in sentence.chunks:
print([word.lemma for word in chunk.words], chunk.attachments)
>>> [u'black', u'cat'] [Chunk('with striped hats/PNP')]
>>> [u'with'] []
>>> [u'striped', u'hat'] []
With the xml() command it can be transformed into an XML string for processing outside of Python:
s = parse('black cats with striped hats')
print xml(s)
Purpose
MBSP stands for "Memory-Based Shallow Parser". Shallow parsing (i.e. automatic discovery of a sentence constituents) is an important component of many text analysis systems, in applications such as information extraction and summary generation. The Memory-Based Learning (MBL) approach has the advantage of avoiding the need for manual definition of patterns (for example, using regular expression syntax) and of being reusable across different corpora and sublanguages.
MBSP is a so-called lazy learner: it keeps all the initial training data available (including exceptions which may sometimes be productive). This technique has been shown to achieve higher accuracy than eager (or greedy) methods for many language processing tasks. For the Wall Street Journal corpus (WSJ), accuracy (Fβ=1) is 96.4% for part-of-speech tagging, 93.8% for <span class="postag">NP</span> chunking, 94.7% for <span class="postag">VP</span> chunking, 77.1% for <span class="postag">SBJ</span> detection, 79.0% for <span class="postag">OBJ</span> detection, and 82.7% for <span class="postag">PP</span>-attachment. MBSP is based on the IB1-IG and IGTREE algorithms bundled in our MBL software package, called TiMBL.
<span style="text-decoration: underline;">Reference</span>: Daelemans,
W., Buchholz, S., & Veenstra, J. (1999).
Memory-Based Shallow Parsing. In: Proceedings of CoNLL, Bergen,
Norway.
The parser provides functionality for tokenization and sentence splitting, part-of-speech tagging, chunking, relation finding, prepositional phrase attachment and lemmatization.
- Tokenization: splits sentence periods and punctuation marks from words.
- Tagging: assigns part-of-speech tags to words (e.g. cat → noun → <span class="postag">NN</span>, eat → verb → <span class="postag">VB</span>).
- Chunking: assigns chunk tags to groups of words (e.g. the black cat → noun phrase → <span class="postag">NP</span>).
- Relation finder: finds relations between chunks, sentence subject, object and predicates.
- PNP finder: finds prepositional noun phrases (e.g. under the table).
- PP-attachment: finds prepositional noun phrase anchors (e.g. eat pizza → with fork).
- Lemmatization: finds word lemmata (e.g. was → be).
Grammar basics
Sentences are made up of words. Words have a syntactic role (noun, verb, adjective, ...) depending on their location in the sentence. For example, can can be a verb or a noun, depending on the context (the can, I can).
- Sentence: the basic unit of writing, expected to have a subject and a predicate.
- Word: a string of characters that expresses a meaningful concept.
- Token: a specific word with grammatical tags: the can/<span class="postag">NN</span>, I can/<span class="postag">VB</span>.
- Chunk: a group of words (phrase) that contains a single thought (e.g. a sumptuous banquet).
- Head: the word that determines the syntactic type of the chunk: the black <span style="text-decoration: underline;">cat</span> → <span class="postag">NP</span>.
- Subject: the person/thing doing or being, usually a noun phrase (<span class="postag">NP</span>): <span style="text-decoration: underline;">the cat</span> is black.
- Predicate: the remainder of the sentence tells us what the subject does: the cat <span style="text-decoration: underline;">sits on the mat</span>.
- Clause: subject + predicate.
- Argument: a chunk that is related to a verb in a clause, i.e. subject and object.
- Object: the person/thing affected by the action: the cat eats <span style="text-decoration: underline;">fish</span>. Poor fish.
- Preposition: temporal, spatial or logical relationship: the cat sits <span style="text-decoration: underline;">on the mat</span>.
- Copula: a word used to link subject and predicate, typically the verb to be.
- Lemma: canonical form of a word: run, runs, running are part of a lexeme, run is the lemma.
- POS: part-of-speech, the syntactic role that a word or phrase plays in a sentence, e.g. adjective = <span class="postag">JJ</span>.
Acknowledgements
This version of MBSP has been developed by the computational linguistics group of CLiPS (Computational Linguistics & Psycholinguistics, department of Linguistics, University of Antwerp, Belgium) on the basis of earlier versions developed at the University of Antwerp and Tilburg University.
Contributing authors: Walter Daelemans, Jakub Zavrel, Sabine
