Questplusplus
Pipelined quality estimation.
Install / Use
/learn @ghpaetzold/QuestplusplusREADME
QuEst++
An open source tool for pipelined Translation Quality Estimation.
This open source software is aimed at quality estimation (QE) for machine translation. It was developed by Professor Lucia Specia's team at the University of Sheffield and includes contributions from a number of researchers. This particular release was made possible through the EXPERT project and funding from EAMT.
QuEst++ is a new release of QuEst, including support for word- and document-level QE. QuEst++ has two independent modules: Feature Extractor Module (developed in Java) and Machine Learning Module (developed in Python).
Citing QuEst++
Lucia Specia, Gustavo Henrique Paetzold and Carolina Scarton (2015): Multi-level Translation Quality Prediction with QuEst++. In Proceedings of ACL-IJCNLP 2015 System Demonstrations, Beijing, China, pp. 115-120. [PDF] [BIBTEX]
System requirements
The Java and python required are:
- Java 8(JDK-1.8)
- NetBeans 8.1 (recommended) OR
- Apache Ant (>= 1.9.3)
- Python 2.7.6 (or above - only 2.7 stable distributions)
- NumPy and SciPy (NumPy >=1.6.1 and SciPy >=0.9)
- scikit-learn (version 0.15.2)
- PyYAML
- CRFsuite
Please note: For Linux, the Feature Extractor Module should work with both OpenJDK and Oracle versions (java-8-oracle recommended)
On Ubuntu, it's easier to install Oracle distribution:
sudo apt-get install oracle-java8-installer
(Check http://ubuntuhandbook.org/index.php/2014/02/install-oracle-java-6-7-or-8-ubuntu-14-04/ if you don't find that version)
NetBeans has issues to build on Linux. Get Ant instead to build through command line:
sudo apt-get install ant
Feature extractor
This module implements a number of feature extractors, for word, sentence and document levels.
Dependencies - tools
Some of the libraries required to compile and run the code are included in the lib directory in the root directory of the distribution. The Java libraries should be included there when possible. However, there are two libraries that were not included into the lib directory due their size (used for word-level features only):
- Stanford Core NLP 3.5.1 models (place the file
stanford-corenlp-3.5.1-models.jarin thelib) - Stanford Core NLP Spanish models
Apart from these libraries files, QuEst++ requires other external tools / scripts to extract the baseline features. The paths for these external tools are set in a configuration file under config folder:
- Perl 5 (or above)
- SRILM (for Language Model features only)
- Tokenizer (available at
lang_resourcesfolder - from Moses toolkit) - Truecaser (available at
lang_resourcesfolder - from Moses toolkit)
For advanced features at sentence and document levels, the following tools can be necessary:
- TreeTagger
- Berkeley Parser (the file
BerkeleyParser-1.7.jaris already inclued in thelibdirectory)
Please note that above list is not exhaustive. Advance set of features require external tools, see details in the features documentation.
Dependencies - resources
The resources required for word, sentence and document-level baseline features are:
- corpus for source language
- corpus for target language
- LM for source language
- LM for target language
- ngram counts file for source language
- ngram counts file for target language
For sentence and document-level features only:
- Truecase model for source language
- Truecase model for target language
- Giza lex file
For word-level only:
- POS ngram counts file for source language
- POS ngram counts file for target language
- corpus com POS information for source language
- corpus com POS information for target language
- reference translations in the target language
- stop words list of the source language
- translation probabilities of the source language
- Universal WordNet plugin (unzip this file inside the
lang_resourcesfolder)
Examples of these resources are provided in the lang_resources folder.
Resources for several languages can be downloaded from WMT15.
Advanced features may require specific data (please read the documentation of the specific features).
Input files
For word and sentence levels, the input files contain one sentence per line. For document level, the input files contain paths to documents (one document per line). Both source and target files should have the same number of lines.
An alignment file should also be provided for word-level feature extraction. This file is generated by Fast Align. Alternatively, we can provide the path for the Fast Align tool on the configuration file and QuEst++ will generate the missing resource.
Output file
The output file contain the features extracted separated by tab.
Word-level features output are features templates for CRF algorithm.
Sentence and document-level features are real values separated by tab.
Build
You can build using NetBeans (version 8.1) - recommended.
Alternatively, you can use Apache Ant (>= 1.9.3):
ant "-Dplatforms.JDK_1.8.home=/usr/lib/jvm/java-8-<<version>>"
The ant command will create all classes needed to use QuEst++ and a QuEst++.jar file.
Basic Usage
- Word-Level:
java -cp QuEst++.jar:lib/* shef.mt.WordLevelFeatureExtractor -lang english spanish -input input/source.word-level.en input/target.word-level.es -alignments lang_resources/alignments/alignments.word-level.out -config config/config.word-level.properties
- Sentence-level:
java -cp QuEst++.jar shef.mt.SentenceLevelFeatureExtractor -tok -case true -lang english spanish -input input/source.sent-level.en input/target.sent-level.es -config config/config.sentence-level.properties
- Document-level:
java -cp QuEst++.jar shef.mt.DocLevelFeatureExtractor -tok -case true -lang english spanish -input input/source.doc-level.en input/target.doc-level.es -config config/config.doc-level.properties
Omit the option -tok if the input files are already tokenised.
The option -case can be no (no casing), true (truecase) or lower (lowercase)
Please note:
- We provide examples of input and language resources for the basic usage commands.
- One need to adapt the configuration file by providing the paths to the scripts where they are installed on your own system (such as SRILM and TreeTagger paths).
Configuration File
QuEst++ configuration file is a structured file (extension .properties) that contains information about the language pairs, featureset and paths to resources and tools. Information about language pairs and features are showed below:
sourceLang.default = spanish
targetLang.default = english
output = output/test
input = input/test
resourcesPath = ./lang_resources
featureConfig = config/features/features_blackbox_17.xml
sourceLang.default- default source languagetargetLang.default- default target languageoutput- output folderinput- input folder (where temporary files will be written)resourcesPath- language resources pathfeatureConfig- features configuration file
An example of parameters related to baseline features (for sentence and document level) are presented below:
source.corpus = ./lang_resources/english/sample_corpus.en
source.lm = ./lang_resources/english/english_lm.lm
source.truecase.model = ./lang_resources/english/truecase-model.en
source.ngram = ./lang_resources/english/english_ngram.ngram.clean
source.tokenizer.lang = en
giza.path = ./lang_resources/giza/lex.e2s
tools.ngram.path = /export/tools/srilm/bin/i686-m64/
source.corpus- path to a corpus of the source languagesource.lm- path to a language model file of the source languagesource.truecase.model- path to a truecase model of the source languagesource.ngram- path to a ngram count file of the source languagesource.tokenizer.lang- language for the tokenizergiza.path- path to the Giza++ lex filetools.ngram.path- path to SRILM
Similarly the config file contains parameters for the target language and for other resources and tools.
Feature Configuration File
This is an XML file containing the features that should be extracted. This file is an input in the configuration file in the 'featureConfig' parameter. An example of this file is showed below:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<features>
<feature class="shef.mt.features.impl.bb.Feature1001
