================================================================================ TurboParser -- Dependency Parser with Linear Programming Relaxations. Version 2.3.x

Written and maintained by André Martins (afm [at] cs.cmu.edu).

This file is part of TurboParser, a project started at the computational linguistics research group, ARK (http://www.ark.cs.cmu.edu/), in Carnegie Mellon University.

This package contains a C++ implementation of the dependency parsers described in:

[1] André F. T. Martins, Noah A. Smith, and Eric P. Xing. 2009. Concise Integer Linear Programming Formulations for Dependency Parsing. In Annual Meeting of the Association for Computational Linguistics (ACL).

[2] André F. T. Martins, Noah A. Smith, and Eric P. Xing. 2009. Polyhedral Outer Approximations with Application to Natural Language Parsing. In International Conference on Machine Learning (ICML).

[3] André F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, and Mário A. T. Figueiredo. 2010. TurboParsers: Dependency Parsing by Approximate Variational Inference. In Empirical Methods in Natural Language Processing (EMNLP).

[4] André F. T. Martins, Noah A. Smith, Mário A. T. Figueiredo, Pedro M. Q. Aguiar. 2011. Dual Decomposition With Many Overlapping Components. In Empirical Methods in Natural Language Processing (EMNLP).

[5] André F. T. Martins, Miguel B. Almeida, Noah A. Smith. 2013. Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers. In Annual Meeting of the Association for Computational Linguistics (ACL).

[6] André F. T. Martins and Mariana S. C. Almeida. 2014. Priberam: A Turbo Semantic Parser with Second Order Features. In International Workshop on Semantic Evaluation (SemEval), task 8: Broad-Coverage Semantic Dependency Parsing.

[7] Daniel F.-González and André F. T. Martins. 2015. Parsing As Reduction. In Annual Meeting of the Association for Computational Linguistics (ACL).

This package allows:

learning the parser from a treebank,
run the parser on new data,
evaluate the results against a gold-standard.

This software has the following external dependencies: AD3, a library for approximate MAP inference (http://www.ark.cs.cmu.edu/AD3/); Eigen, a template library for linear algebra; glog, a library for logging; gflags, a library for commandline flag processing. All these libraries are free software and are provided as tarballs in this package.

This package has been tested in several Linux platforms. It has also successfully compiled in Mac OS X and MS Windows (using MSVC).

Since version 2.2.x, the following is also provided:

a Python wrapper for the tagger and parser (requires Cython 0.19);
a semantic role labeler (TurboSemanticParser) implementing ref. [6] above.

Since version 2.3.x, we also provide:

a named entity recognizer (TurboEntityRecognizer).
a coreference resolver (TurboCoreferenceResolver).
a constituent parser based on dependency-to-constituent reduction, implementing ref. [7] below.
a dependency labeler, TurboDependencyLabeler, that can optionally be applied after the dependency parser.
compatibility with MS Windows (using MSVC) and with C++0x.

If there are any problems running the parser please email: afm [at] cs.cmu.edu I will only respond to questions not answered in this README.

We would like to thank Ryan McDonald and Jason Baldridge by MSTParser (available at http://sourceforge.net/projects/mstparser), in which the code in this package was partly based.

================================================================================

TurboParser is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

TurboParser is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.

================================================================================ Contents

Compiling
Example of usage a. TurboParser b. TurboTagger c. Scripts
Running the parser a. Input data format b. Training the parser c. Training the tagger d. Running the trained tagger/parser on new data e. Additional options
Installing the Python wrapper
Memory/Disk space and performance issues
Reproducing results in the ICML, ACL, and EMNLP papers
Reproducing results in the SemEval 2014 paper (TurboSemanticParser)

================================================================================

Compiling ================================================================================

To compile the code, first unzip/tar the downloaded file:

tar -zxvf TurboParser-2.2.0.tar.gz cd TurboParser-2.2.0

Next, run the following command

./install_deps.sh

This will install all the dependencies (libraries gflags, glog, Eigen, and AD3). Finally, type

./configure && make && make install

After these steps, a file named "TurboParser" and another named "TurboTagger" should have been created under the working folder.

Before starting to use TurboParser and TurboTagger, we need to add our local dependencies to the library path. This can be done via:

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:pwd;/deps/local/lib:"

================================================================================ 2. Example Usage

The directory data/sample contains small samples of training and testing data. The data format is the one used in the CoNLL-X shared task, which we describe in the next section. The following sample files are provided:

sample_train.conll sample_test.conll

================================================================================ 2a. TurboParser

Before starting, we need to add our local dependencies to the library path:

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:pwd;/deps/local/lib:"

These steps will train a parser on the training data, run it on the testing data, and evaluate the output against the gold standard:

mkdir models

./TurboParser --train
--file_train=data/sample/sample_train.conll
--file_model=models/sample_parser.model
--logtostderr

./TurboParser --test
--evaluate
--file_model=models/sample_parser.model
--file_test=data/sample/sample_test.conll
--file_prediction=data/sample/sample_test.conll.predicted
--logtostderr

The results from running the parser are in the file data/sample/sample_test.conll.predicted and the trained model in models/sample_parser.model.

================================================================================ 2b. TurboTagger

If you have not done this yet, add your local dependencies to the library path:

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:pwd;/deps/local/lib:"

The input files in TurboTagger are not CoNLL files; they have the same tabular form, but should only have two columns, the first one for the words and the second one for the parts-of-speech.

To test TurboTagger, run first the following script to convert the sample files to this format:

./scripts/create_tagging_corpus.sh data/sample/sample_train.conll ./scripts/create_tagging_corpus.sh data/sample/sample_test.conll

This will create files sample_train.conll.tagging and sample_test.conll.tagging. Then, run:

mkdir models

./TurboTagger --train
--file_train=data/sample/sample_train.conll.tagging
--file_model=models/sample_tagger.model
--form_cutoff=1
--logtostderr

./TurboTagger --test
--evaluate
--file_model=models/sample_tagger.model
--file_test=data/sample/sample_test.conll.tagging
--file_prediction=data/sample/sample_test.conll.tagging.predicted
--logtostderr

The results from running the tagger are in the file data/sample/sample_test.conll.tagging.predicted and the trained model in models/sample_tagger.model.

================================================================================ 2c. Scripts

Shell scripts are provided in the folder ./scripts that allow you to train, test, and evaluate the parser and the tagger with several options.

If you type:

cd scripts ./train_test_parser.sh sample ./train_test_tagger.sh sample

You will perform all the operations described above (the results are not necessarily the same, since some parameter settings in the scripts may be different).

We suggest you to look at these scripts and to edit them at your own needs.

================================================================================ 3. Running the Parser

================================================================================ 3a. Input data format

The data format is the same as the CONLL-X shared task. Here is a sample of two sentences from the Dutch dataset:

1 Cathy Cathy N N eigen|ev|neut 2 su _ _ 2 zag zie V V trans|ovt|1of2of3|ev 0 ROOT _ _ 3 hen hen Pron Pron per|3|mv|datof

TurboParser

Install / Use

README

Written and maintained by André Martins (afm [at] cs.cmu.edu).

================================================================================ Contents

================================================================================ 2. Example Usage

================================================================================ 2a. TurboParser

================================================================================ 2b. TurboTagger

================================================================================ 2c. Scripts

================================================================================ 3. Running the Parser

================================================================================ 3a. Input data format