MBSP is a text analysis system based on the TiMBL and MBT memory based learning applications developed at CLiPS and ILK. It provides tools for Tokenization and Sentence Splitting, Part of Speech Tagging, Chunking, Lemmatization, Relation Finding and Prepositional Phrase Attachment. The general English version of MBSP has been trained on data from the Wall Street Journal corpus.

MBSP_schema

Download

<table> <colgroup> <col style="width: 33%" /> <col style="width: 33%" /> <col style="width: 33%" /> </colgroup> <tbody> <tr class="odd"> <td><a href="/media/MBSP_1.4.zip"><img src="/sites/all/themes/clips/g/download.gif" alt="download" /></a></td> <td>MBSP for Python (1.4) | <a href="/media/MBSP_1.4.zip">download</a> (.zip, 24MB) <ul> <li>Requires: Python 2.5+ on Unix | Mac | Cygwin.</li> <li>Licensed under <a href="http://www.gnu.org/licenses/gpl.html">GPL</a></li> <li>Releases: <a href="/media/MBSP_1.4.zip">1.4</a> | <a href="/media/MBSP_1.3.zip">1.3</a> | <a href="/media/MBSP_1.2.zip">1.2</a> | <a href="/media/MBSP_1.1.zip">1.1</a></li> <li>Authors: Vincent Van Asch (vincent.vanasch at uantwerpen.be), Tom De Smedt (tom at organisms.be)</li> </ul> Reference: Daelemans, W., & Van den Bosch, A. (2005). Memory-based language processing. Cambridge University Press, Cambridge, UK. ISBN-13: 9780521808903 | ISBN-10: 0521808901 SHA256 checksum of the .zip: 62be10ece640404058b607d9f15493a2a98d55b7e3a3d9becc6ed5854fa800bc </td> <td><a href="http://twitter.com/share" class="twitter-share-button">Tweet</a></td> </tr> </tbody> </table>

Introduction

Quick overview

MBSP parses a string of characters into words and sentences, and determines the grammatical structure of the sentence. It is a Python module, so you'll need Python to run it (already installed on Mac OS X).

The module uses a client-server architecture for performance. It includes binaries (TiMBL, MBT and MBLEM) precompiled for Mac OS X, so on Mac it works out-of-the-box. Otherwise, if you're on a Unix system, the module has a setup.py file that should compile everything for you. Go to the terminal and type:

cd MBSP
python setup.py

If that doesn't work you'll need to follow the steps in the installation instructions.

Put the MBSP folder in the same folder as your Python script and import the module. By default, the servers are configured to start automatically. Once they are up and running you can use the parse() command to analyze texts:

import MBSP
print MBSP.parse('cats with hats')
>>> cats/NNS/I-NP/O/O/A1/cat with/IN/I-PP/B-PNP/O/P1/with hats/NNS/I-NP/I-PNP/O/P1/hat

Each word has been tagged with grammatical information. For example, MBSP determined that cats is a plural noun (NNS). It has a prepositional noun phrase (PNP) attached to it (A1 is the anchor of P1), so the hats go with the cats. For a human this might seem pretty straightforward, but consider that without any analysis, for a machine the sentence is just a sequence of characters with no meaning.

The tag codes may seem cryptic at first, but consider that it is more concise to say NNS than PLURAL NOUN over and over. The tag codes are common in natural language processing, it's a good idea to get acquainted with them.

Something went wrong? Probably the servers didn't have enough time to start:

MBSP.start(timeout=120)
print(MBSP.parse('cats with hats'))

The output of the parse() command is a tagged string that can be manipulated in many ways.
With the split() command it can be transformed into a tree of linked Python objects:

s = MBSP.parse('black cats with striped hats')
s = MBSP.split(s)
for sentence in s:
    for chunk in sentence.chunks:
        print([word.lemma for word in chunk.words], chunk.attachments)
>>> [u'black', u'cat'] [Chunk('with striped hats/PNP')]
>>> [u'with'] []
>>> [u'striped', u'hat'] []

With the xml() command it can be transformed into an XML string for processing outside of Python:

s = parse('black cats with striped hats')
print xml(s)

Purpose

MBSP stands for "Memory-Based Shallow Parser". Shallow parsing (i.e. automatic discovery of a sentence constituents) is an important component of many text analysis systems, in applications such as information extraction and summary generation. The Memory-Based Learning (MBL) approach has the advantage of avoiding the need for manual definition of patterns (for example, using regular expression syntax) and of being reusable across different corpora and sublanguages.

MBSP is a so-called lazy learner: it keeps all the initial training data available (including exceptions which may sometimes be productive). This technique has been shown to achieve higher accuracy than eager (or greedy) methods for many language processing tasks. For the Wall Street Journal corpus (WSJ), accuracy (Fβ=1) is 96.4% for part-of-speech tagging, 93.8% for NP chunking, 94.7% for VP chunking, 77.1% for SBJ detection, 79.0% for OBJ detection, and 82.7% for PP-attachment. MBSP is based on the IB1-IG and IGTREE algorithms bundled in our MBL software package, called TiMBL.

Reference: Daelemans, W., Buchholz, S., & Veenstra, J. (1999).
Memory-Based Shallow Parsing. In: Proceedings of CoNLL, Bergen, Norway.

The parser provides functionality for tokenization and sentence splitting, part-of-speech tagging, chunking, relation finding, prepositional phrase attachment and lemmatization.

Tokenization: splits sentence periods and punctuation marks from words.
Tagging: assigns part-of-speech tags to words (e.g. cat → noun → NN, eat → verb → VB).
Chunking: assigns chunk tags to groups of words (e.g. the black cat → noun phrase → NP).
Relation finder: finds relations between chunks, sentence subject, object and predicates.
PNP finder: finds prepositional noun phrases (e.g. under the table).
PP-attachment: finds prepositional noun phrase anchors (e.g. eat pizza → with fork).
Lemmatization: finds word lemmata (e.g. was → be).

Grammar basics

Sentences are made up of words. Words have a syntactic role (noun, verb, adjective, ...) depending on their location in the sentence. For example, can can be a verb or a noun, depending on the context (the can, I can).

Sentence: the basic unit of writing, expected to have a subject and a predicate.
Word: a string of characters that expresses a meaningful concept.
Token: a specific word with grammatical tags: the can/NN, I can/VB.
Chunk: a group of words (phrase) that contains a single thought (e.g. a sumptuous banquet).
Head: the word that determines the syntactic type of the chunk: the black cat → NP.
Subject: the person/thing doing or being, usually a noun phrase (NP): the cat is black.
Predicate: the remainder of the sentence tells us what the subject does: the cat sits on the mat.
Clause: subject + predicate.
Argument: a chunk that is related to a verb in a clause, i.e. subject and object.
Object: the person/thing affected by the action: the cat eats fish. Poor fish.
Preposition: temporal, spatial or logical relationship: the cat sits on the mat.
Copula: a word used to link subject and predicate, typically the verb to be.
Lemma: canonical form of a word: run, runs, running are part of a lexeme, run is the lemma.
POS: part-of-speech, the syntactic role that a word or phrase plays in a sentence, e.g. adjective = JJ.

Acknowledgements

This version of MBSP has been developed by the computational linguistics group of CLiPS (Computational Linguistics & Psycholinguistics, department of Linguistics, University of Antwerp, Belgium) on the basis of earlier versions developed at the University of Antwerp and Tilburg University.

Contributing authors: Walter Daelemans, Jakub Zavrel, Sabine

MBSP

Install / Use

README