Multeval
Easy Bootstrap Resampling and Approximate Randomization for BLEU, METEOR, and TER using Multiple Optimizer Runs. This implements "Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability" from ACL 2011.
Install / Use
/learn @jhclark/MultevalREADME
Overview
MultEval takes machine translation hypotheses from several runs of an optimizer and provides 3 popular metric scores, as well as, standard deviations (via bootstrap resampling) and p-values (via approximate randomization). This allows researchers to mitigate some of the risk of using unstable optimizers such as MERT, MIRA, and MCMC. It is intended to help in evaluating the impact of in-house experimental variations on translation quality; it is currently not setup to do bake-off style comparisons (bake-offs can't require multiple optimizer runs nor a standard tokenization).
It is a user-friendly implementation of: Jonathan Clark, Chris Dyer, Alon Lavie, and Noah Smith, "Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability", Proceedings of the Association for Computational Lingustics, 2011. PDF
To keep updated on new versions of this software, subscribe to our low-traffic announcement mailing list: http://groups.google.com/group/multeval-announce. All active users are encourated to subscribe.
Usage
First, download and unpack the program:
wget http://www.cs.cmu.edu/~jhclark/downloads/multeval-0.5.1.tgz
tar -xvzf multeval-0.5.1.tgz
To evaluate a single system from the example data and get its BLEU, METEOR, and TER scores along with its standard deviation use:
./multeval.sh eval --refs example/refs.test2010.lc.tok.en.* \
--hyps-baseline example/hyps.lc.tok.en.baseline.opt* \
--meteor.language en
The first time you run this command, METEOR (and its sizable paraphrase tables) will be downloaded. Also, to help the user determine if any tokenization mismatch happened, MultEval also prints the top OOVs according to METEOR.
To compare several systems from the example data and get its BLEU, METEOR, and TER scores along with their standard deviations and p-values, use:
./multeval.sh eval --refs example/refs.test2010.lc.tok.en.* \
--hyps-baseline example/hyps.lc.tok.en.baseline.opt* \
--hyps-sys1 example/hyps.lc.tok.en.sys1.opt* \
--hyps-sys2 example/hyps.lc.tok.en.sys2.opt* \
--meteor.language en
If you'd also like 1) a Latex table at you can copy-paste into your paper and 2) the hypotheses from the median optimization run ranked by improvement/decline over your baseline system and 3) A list of sentence-level metric scores including submetrics such as BLEU precision and brevity, then run it like this:
./multeval.sh eval --refs example/refs.test2010.lc.tok.en.* \
--hyps-baseline example/hyps.lc.tok.en.baseline.opt* \
--hyps-sys1 example/hyps.lc.tok.en.sys1.opt* \
--hyps-sys2 example/hyps.lc.tok.en.sys2.opt* \
--meteor.language en \
--latex table.tex \
--rankDir rank \
--sentLevelDir sentLevel
All files should contain tokenized, lowercased, space-delimited sentences in UTF-8 encoding, one sentence per line. Unlike many metric implementations, MultEval does no tokenization or segmentation for you (see discussion below).
Generally, you should evaluate full forms (i.e. without word segmentation). For languages without a canonical notion of words (e.g. Chinese, Japanese), we recommend splitting all non-Latin characters (e.g. each character that is not part of a borrowed Western word, URL, etc. should be evaluated as its own word.)
For a more detailed description of the various METEOR options, please see http://github.com/mjdenkowski/meteor.
METEOR and its paraphrase tables will automatically be downloaded from the web the first time you run multeval.sh. They are not included in the initial download due to the large size (~200MB) of the paraphrase tables.
The ASCII table produced by multeval looks something like this:
n=3 BLEU (s_sel/s_opt/p) METEOR (s_sel/s_opt/p) TER (s_sel/s_opt/p) Length (s_sel/s_opt/p)
baseline 18.5 (0.3/0.1/-) 29.3 (0.1/0.0/-) 65.7 (0.4/0.2/-) 107.5 (0.4/0.1/-)
system 1 18.8 (0.3/0.3/0.00) 30.3 (0.1/0.1/0.00) 64.8 (0.4/0.6/0.00) 107.7 (0.3/0.7/0.09)
system 2 18.5 (0.3/0.1/0.00) 29.3 (0.1/0.0/0.00) 65.7 (0.4/0.2/0.00) 107.5 (0.4/0.1/0.00)
A quick explanation of these numbers (see paper for details):
- s_sel: The variance due to test set SELection. This is calculated using bootstrap resampling for each optimizer run and this number reports the average variance over all optimizer runs.
- s_opt: The variance due to OPTimizer instability. This is calculated directly as the variance of the aggregate metric score over all optimizer runs.
- p: This is the p-value calculated by approximate randomization. It can roughly be interpreted as the probability of the absolute difference between the baseline system and system i occurring due to chance where random permutations between the two systems are used to simulate chance occurrences. The quality of this measure depends on the n separate optimization runs of your system and is conditioned on your test set. See below for a more in-depth discussion on p-values.
The LaTeX table produced by multeval looks something like this:

To see a full list of options, use:
./multeval.sh eval
which gives:
Usage: program <module_name> [options...]
=== TER ===
-T [--ter.shiftCost] Shift cost for TER
-d [--ter.maxShiftDistance] Maximum shift distance for TER
-P [--ter.punctuation] Use punctuation in TER?
-b [--ter.beamWidth] Beam width for TER
-B [--ter.substituteCost] Substitute cost for TER
-D [--ter.deleteCost] Delete cost for TER
-M [--ter.matchCost] Match cost for TER
-I [--ter.insertCost] Insert cost for TER
=== BLEU ===
=== METEOR ===
-t [--meteor.task] One of: rank adq hter tune (Rank is generally a good choice)
-s [--meteor.synonymDirectory] If default is not desired (NOTE: This option has a different short flag than stand-alone METEOR) [optional]
-x [--meteor.beamSize] Specify beam size (overrides default)
-p [--meteor.params] Custom parameters of the form 'alpha beta gamma' (overrides default) [optional]
-w [--meteor.weights] Specify module weights (overrides default) [optional]
-a [--meteor.paraphraseFile] If default is not desired [optional]
-m [--meteor.modules] Specify modules. (overrides default) Any of: exact stem synonym paraphrase [optional]
-k [--meteor.keepPunctuation] Consider punctuation when aligning sentences (if false, the meteor tokenizer will be run, after which punctuation will be removed)
-l [--meteor.language] Two-letter language code of a supported METEOR language (e.g. 'en')
=== MultEvalModule (for eval module) ===
-b [--boot-samples] Number of bootstrap replicas to draw during bootstrap resampling to estimate standard deviation for each system
-H [--hyps-sys] Space-delimited list of files containing tokenized, fullform hypotheses, one per line
-s [--ar-shuffles] Number of shuffles to perform to estimate p-value during approximate randomization test system *PAIR*
-r [--rankDir] Rank hypotheses of median optimization run of each system with regard to improvement/decline over median baseline system and output to the specified directory for analysis [optional]
-R [--refs] Space-delimited list of files containing tokenized, fullform references, one per line
-o [--metrics] Space-delimited list of metrics to use. Any of: bleu, meteor, ter, length
-F [--fullLatexDoc] Output a fully compilable Latex document instead of just the table alone [optional]
-L [--latex] Latex-formatted table including measures that are commonly (or should be commonly) reported [optional]
-D [--debug] Show debugging output? [optional]
-B [--hyps-baseline] Space-delimited list of files containing tokenized, fullform hypotheses, one per line
-v [--verbosity] Verbosity level
--help help message
What do p-values actually mean?
A p-value is a model's estimate (where the model is a significance test) that a particular difference in scores arose by chance. Multeval uses approximate randomization, a test that approximates a permutation test via sampling shufflings of like hypotheses between systems.
The most important points are:
- a p-value does tell you whether a difference of this magnitude is likely to be generated again by some random process (a randomized optimizer)
- a p-value does not tell you whether a difference of this magnitude is meaningful (in terms of translation quality)
So even though a large difference may more frequently correspond to smaller p-values, this is not guaranteed. In fact, small differences can be quite significant and vice versa. For example, if you give a single optimizer sample with identical hypotheses and tell MultEval that these are actually two different systems (as in the baseline system and system 2 in the example data), there will be zero difference in scores and also a p-value of zero, since shuffling hypotheses between the systems produces no change, indicating that this difference (of zero) is likely to be reproducible. This demonstrates 2 points about p-values: 1) that this significance test does not account for the user giv
Related Skills
node-connect
335.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
335.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.7kCommit, push, and open a PR

