QuEst++

An open source tool for pipelined Translation Quality Estimation.

This open source software is aimed at quality estimation (QE) for machine translation. It was developed by Professor Lucia Specia's team at the University of Sheffield and includes contributions from a number of researchers. This particular release was made possible through the EXPERT project and funding from EAMT.

QuEst++ is a new release of QuEst, including support for word- and document-level QE. QuEst++ has two independent modules: Feature Extractor Module (developed in Java) and Machine Learning Module (developed in Python).

Citing QuEst++

Lucia Specia, Gustavo Henrique Paetzold and Carolina Scarton (2015): Multi-level Translation Quality Prediction with QuEst++. In Proceedings of ACL-IJCNLP 2015 System Demonstrations, Beijing, China, pp. 115-120. [PDF] [BIBTEX]

System requirements

The Java and python required are:

Java 8(JDK-1.8)
NetBeans 8.1 (recommended) OR
Apache Ant (>= 1.9.3)
Python 2.7.6 (or above - only 2.7 stable distributions)
NumPy and SciPy (NumPy >=1.6.1 and SciPy >=0.9)
scikit-learn (version 0.15.2)
PyYAML
CRFsuite

Please note: For Linux, the Feature Extractor Module should work with both OpenJDK and Oracle versions (java-8-oracle recommended)

On Ubuntu, it's easier to install Oracle distribution:

sudo apt-get install oracle-java8-installer

(Check http://ubuntuhandbook.org/index.php/2014/02/install-oracle-java-6-7-or-8-ubuntu-14-04/ if you don't find that version)

NetBeans has issues to build on Linux. Get Ant instead to build through command line:

sudo apt-get install ant

Feature extractor

This module implements a number of feature extractors, for word, sentence and document levels.

Dependencies - tools

Some of the libraries required to compile and run the code are included in the lib directory in the root directory of the distribution. The Java libraries should be included there when possible. However, there are two libraries that were not included into the lib directory due their size (used for word-level features only):

Stanford Core NLP 3.5.1 models (place the file stanford-corenlp-3.5.1-models.jar in the lib)
Stanford Core NLP Spanish models

Apart from these libraries files, QuEst++ requires other external tools / scripts to extract the baseline features. The paths for these external tools are set in a configuration file under config folder:

Perl 5 (or above)
SRILM (for Language Model features only)
Tokenizer (available at lang_resources folder - from Moses toolkit)
Truecaser (available at lang_resources folder - from Moses toolkit)

For advanced features at sentence and document levels, the following tools can be necessary:

TreeTagger
Berkeley Parser (the file BerkeleyParser-1.7.jar is already inclued in the lib directory)

Please note that above list is not exhaustive. Advance set of features require external tools, see details in the features documentation.

Dependencies - resources

The resources required for word, sentence and document-level baseline features are:

corpus for source language
corpus for target language
LM for source language
LM for target language
ngram counts file for source language
ngram counts file for target language

For sentence and document-level features only:

Truecase model for source language
Truecase model for target language
Giza lex file

For word-level only:

POS ngram counts file for source language
POS ngram counts file for target language
corpus com POS information for source language
corpus com POS information for target language
reference translations in the target language
stop words list of the source language
translation probabilities of the source language
Universal WordNet plugin (unzip this file inside the lang_resources folder)

Examples of these resources are provided in the lang_resources folder. Resources for several languages can be downloaded from WMT15. Advanced features may require specific data (please read the documentation of the specific features).

Input files

For word and sentence levels, the input files contain one sentence per line. For document level, the input files contain paths to documents (one document per line). Both source and target files should have the same number of lines.

An alignment file should also be provided for word-level feature extraction. This file is generated by Fast Align. Alternatively, we can provide the path for the Fast Align tool on the configuration file and QuEst++ will generate the missing resource.

Output file

The output file contain the features extracted separated by tab. Word-level features output are features templates for CRF algorithm. Sentence and document-level features are real values separated by tab.

Build

You can build using NetBeans (version 8.1) - recommended.

Alternatively, you can use Apache Ant (>= 1.9.3):

ant "-Dplatforms.JDK_1.8.home=/usr/lib/jvm/java-8-<<version>>"

The ant command will create all classes needed to use QuEst++ and a QuEst++.jar file.

Basic Usage

Word-Level:

java -cp QuEst++.jar:lib/* shef.mt.WordLevelFeatureExtractor -lang english spanish -input input/source.word-level.en input/target.word-level.es -alignments lang_resources/alignments/alignments.word-level.out -config config/config.word-level.properties

Sentence-level:

java -cp QuEst++.jar shef.mt.SentenceLevelFeatureExtractor -tok -case true -lang english spanish -input input/source.sent-level.en input/target.sent-level.es -config config/config.sentence-level.properties

Document-level:

java -cp QuEst++.jar shef.mt.DocLevelFeatureExtractor -tok -case true -lang english spanish -input input/source.doc-level.en input/target.doc-level.es -config config/config.doc-level.properties

Omit the option -tok if the input files are already tokenised. The option -case can be no (no casing), true (truecase) or lower (lowercase)

Please note:

We provide examples of input and language resources for the basic usage commands.
One need to adapt the configuration file by providing the paths to the scripts where they are installed on your own system (such as SRILM and TreeTagger paths).

Configuration File

QuEst++ configuration file is a structured file (extension .properties) that contains information about the language pairs, featureset and paths to resources and tools. Information about language pairs and features are showed below:

sourceLang.default	= spanish
targetLang.default	= english
output			= output/test
input 			= input/test
resourcesPath 		= ./lang_resources
featureConfig 		= config/features/features_blackbox_17.xml

sourceLang.default - default source language
targetLang.default - default target language
output - output folder
input - input folder (where temporary files will be written)
resourcesPath - language resources path
featureConfig - features configuration file

An example of parameters related to baseline features (for sentence and document level) are presented below:

source.corpus               = ./lang_resources/english/sample_corpus.en
source.lm		    = ./lang_resources/english/english_lm.lm
source.truecase.model       = ./lang_resources/english/truecase-model.en
source.ngram                = ./lang_resources/english/english_ngram.ngram.clean
source.tokenizer.lang       = en
giza.path                   = ./lang_resources/giza/lex.e2s
tools.ngram.path	    = /export/tools/srilm/bin/i686-m64/

source.corpus - path to a corpus of the source language
source.lm - path to a language model file of the source language
source.truecase.model - path to a truecase model of the source language
source.ngram - path to a ngram count file of the source language
source.tokenizer.lang - language for the tokenizer
giza.path - path to the Giza++ lex file
tools.ngram.path - path to SRILM

Similarly the config file contains parameters for the target language and for other resources and tools.

Feature Configuration File

This is an XML file containing the features that should be extracted. This file is an input in the configuration file in the 'featureConfig' parameter. An example of this file is showed below:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<features>
  <feature class="shef.mt.features.impl.bb.Feature1001

Questplusplus

Install / Use

README