SkillAgentSearch skills...

Questplusplus

Pipelined quality estimation.

Install / Use

/learn @ghpaetzold/Questplusplus
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

QuEst++

An open source tool for pipelined Translation Quality Estimation.

This open source software is aimed at quality estimation (QE) for machine translation. It was developed by Professor Lucia Specia's team at the University of Sheffield and includes contributions from a number of researchers. This particular release was made possible through the EXPERT project and funding from EAMT.

QuEst++ is a new release of QuEst, including support for word- and document-level QE. QuEst++ has two independent modules: Feature Extractor Module (developed in Java) and Machine Learning Module (developed in Python).


Citing QuEst++

Lucia Specia, Gustavo Henrique Paetzold and Carolina Scarton (2015): Multi-level Translation Quality Prediction with QuEst++. In Proceedings of ACL-IJCNLP 2015 System Demonstrations, Beijing, China, pp. 115-120. [PDF] [BIBTEX]


System requirements

The Java and python required are:

  1. Java 8(JDK-1.8)
  2. NetBeans 8.1 (recommended) OR
  3. Apache Ant (>= 1.9.3)
  4. Python 2.7.6 (or above - only 2.7 stable distributions)
  5. NumPy and SciPy (NumPy >=1.6.1 and SciPy >=0.9)
  6. scikit-learn (version 0.15.2)
  7. PyYAML
  8. CRFsuite

Please note: For Linux, the Feature Extractor Module should work with both OpenJDK and Oracle versions (java-8-oracle recommended)

On Ubuntu, it's easier to install Oracle distribution:

sudo apt-get install oracle-java8-installer

(Check http://ubuntuhandbook.org/index.php/2014/02/install-oracle-java-6-7-or-8-ubuntu-14-04/ if you don't find that version)

NetBeans has issues to build on Linux. Get Ant instead to build through command line:

sudo apt-get install ant

Feature extractor

This module implements a number of feature extractors, for word, sentence and document levels.

Dependencies - tools

Some of the libraries required to compile and run the code are included in the lib directory in the root directory of the distribution. The Java libraries should be included there when possible. However, there are two libraries that were not included into the lib directory due their size (used for word-level features only):

Apart from these libraries files, QuEst++ requires other external tools / scripts to extract the baseline features. The paths for these external tools are set in a configuration file under config folder:

  • Perl 5 (or above)
  • SRILM (for Language Model features only)
  • Tokenizer (available at lang_resources folder - from Moses toolkit)
  • Truecaser (available at lang_resources folder - from Moses toolkit)

For advanced features at sentence and document levels, the following tools can be necessary:

Please note that above list is not exhaustive. Advance set of features require external tools, see details in the features documentation.

Dependencies - resources

The resources required for word, sentence and document-level baseline features are:

  • corpus for source language
  • corpus for target language
  • LM for source language
  • LM for target language
  • ngram counts file for source language
  • ngram counts file for target language

For sentence and document-level features only:

  • Truecase model for source language
  • Truecase model for target language
  • Giza lex file

For word-level only:

  • POS ngram counts file for source language
  • POS ngram counts file for target language
  • corpus com POS information for source language
  • corpus com POS information for target language
  • reference translations in the target language
  • stop words list of the source language
  • translation probabilities of the source language
  • Universal WordNet plugin (unzip this file inside the lang_resources folder)

Examples of these resources are provided in the lang_resources folder. Resources for several languages can be downloaded from WMT15. Advanced features may require specific data (please read the documentation of the specific features).

Input files

For word and sentence levels, the input files contain one sentence per line. For document level, the input files contain paths to documents (one document per line). Both source and target files should have the same number of lines.

An alignment file should also be provided for word-level feature extraction. This file is generated by Fast Align. Alternatively, we can provide the path for the Fast Align tool on the configuration file and QuEst++ will generate the missing resource.

Output file

The output file contain the features extracted separated by tab. Word-level features output are features templates for CRF algorithm. Sentence and document-level features are real values separated by tab.

Build

You can build using NetBeans (version 8.1) - recommended.

Alternatively, you can use Apache Ant (>= 1.9.3):

ant "-Dplatforms.JDK_1.8.home=/usr/lib/jvm/java-8-<<version>>"

The ant command will create all classes needed to use QuEst++ and a QuEst++.jar file.

Basic Usage

  1. Word-Level:
java -cp QuEst++.jar:lib/* shef.mt.WordLevelFeatureExtractor -lang english spanish -input input/source.word-level.en input/target.word-level.es -alignments lang_resources/alignments/alignments.word-level.out -config config/config.word-level.properties
  1. Sentence-level:
java -cp QuEst++.jar shef.mt.SentenceLevelFeatureExtractor -tok -case true -lang english spanish -input input/source.sent-level.en input/target.sent-level.es -config config/config.sentence-level.properties
  1. Document-level:
java -cp QuEst++.jar shef.mt.DocLevelFeatureExtractor -tok -case true -lang english spanish -input input/source.doc-level.en input/target.doc-level.es -config config/config.doc-level.properties

Omit the option -tok if the input files are already tokenised. The option -case can be no (no casing), true (truecase) or lower (lowercase)

Please note:

  1. We provide examples of input and language resources for the basic usage commands.
  2. One need to adapt the configuration file by providing the paths to the scripts where they are installed on your own system (such as SRILM and TreeTagger paths).

Configuration File

QuEst++ configuration file is a structured file (extension .properties) that contains information about the language pairs, featureset and paths to resources and tools. Information about language pairs and features are showed below:

sourceLang.default	= spanish
targetLang.default	= english
output			= output/test
input 			= input/test
resourcesPath 		= ./lang_resources
featureConfig 		= config/features/features_blackbox_17.xml
  • sourceLang.default - default source language
  • targetLang.default - default target language
  • output - output folder
  • input - input folder (where temporary files will be written)
  • resourcesPath - language resources path
  • featureConfig - features configuration file

An example of parameters related to baseline features (for sentence and document level) are presented below:

source.corpus               = ./lang_resources/english/sample_corpus.en
source.lm		    = ./lang_resources/english/english_lm.lm
source.truecase.model       = ./lang_resources/english/truecase-model.en
source.ngram                = ./lang_resources/english/english_ngram.ngram.clean
source.tokenizer.lang       = en
giza.path                   = ./lang_resources/giza/lex.e2s
tools.ngram.path	    = /export/tools/srilm/bin/i686-m64/
  • source.corpus - path to a corpus of the source language
  • source.lm - path to a language model file of the source language
  • source.truecase.model - path to a truecase model of the source language
  • source.ngram - path to a ngram count file of the source language
  • source.tokenizer.lang - language for the tokenizer
  • giza.path - path to the Giza++ lex file
  • tools.ngram.path - path to SRILM

Similarly the config file contains parameters for the target language and for other resources and tools.

Feature Configuration File

This is an XML file containing the features that should be extracted. This file is an input in the configuration file in the 'featureConfig' parameter. An example of this file is showed below:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<features>
  <feature class="shef.mt.features.impl.bb.Feature1001
View on GitHub
GitHub Stars51
CategoryDevelopment
Updated1y ago
Forks12

Languages

Charity

Security Score

65/100

Audited on Feb 13, 2025

No findings