sacreBLEU

SacreBLEU (Post, 2018) provides hassle-free computation of shareable, comparable, and reproducible BLEU scores. Inspired by Rico Sennrich's multi-bleu-detok.perl, it produces the official WMT scores but works with plain text. It also knows all the standard test sets and handles downloading, processing, and tokenization for you.

The official version is hosted at https://github.com/mjpost/sacrebleu.

Motivation

Comparing BLEU scores is harder than it should be. Every decoder has its own implementation, often borrowed from Moses, but maybe with subtle changes. Moses itself has a number of implementations as standalone scripts, with little indication of how they differ (note: they mostly don't, but multi-bleu.pl expects tokenized input). Different flags passed to each of these scripts can produce wide swings in the final score. All of these may handle tokenization in different ways. On top of this, downloading and managing test sets is a moderate annoyance.

Sacre bleu! What a mess.

SacreBLEU aims to solve these problems by wrapping the original reference implementation (Papineni et al., 2002) together with other useful features. The defaults are set the way that BLEU should be computed, and furthermore, the script outputs a short version string that allows others to know exactly what you did. As an added bonus, it automatically downloads and manages test sets for you, so that you can simply tell it to score against wmt14, without having to hunt down a path on your local file system. It is all designed to take BLEU a little more seriously. After all, even with all its problems, BLEU is the default and---admit it---well-loved metric of our entire research community. Sacre BLEU.

Features

It automatically downloads common WMT test sets and processes them to plain text
It produces a short version string that facilitates cross-paper comparisons
It properly computes scores on detokenized outputs, using WMT (Conference on Machine Translation) standard tokenization
It produces the same values as the official script (mteval-v13a.pl) used by WMT
It outputs the BLEU score without the comma, so you don't have to remove it with sed (Looking at you, multi-bleu.perl)
It supports different tokenizers for BLEU including support for Japanese and Chinese
It supports chrF, chrF++ and Translation error rate (TER) metrics
It performs paired bootstrap resampling and paired approximate randomization tests for statistical significance reporting

Breaking Changes

v2.0.0

As of v2.0.0, the default output format is changed to json for less painful parsing experience. This means that software that parse the output of sacreBLEU should be modified to either (i) parse the JSON using for example the jq utility or (ii) pass -f text to sacreBLEU to preserve the old textual output. The latter change can also be made persistently by exporting SACREBLEU_FORMAT=text in relevant shell configuration files.

Here's an example of parsing the score key of the JSON output using jq:

$ sacrebleu -i output.detok.txt -t wmt17 -l en-de | jq -r .score
20.8

Installation

Install the official Python module from PyPI (Python>=3.9 only):

pip install sacrebleu

In order to install Japanese tokenizer support through mecab-python3, you need to run the following command instead, to perform a full installation with dependencies:

pip install "sacrebleu[ja]"

In order to install Korean tokenizer support through pymecab-ko, you need to run the following command instead, to perform a full installation with dependencies:

pip install "sacrebleu[ko]"

Command-line Usage

You can get a list of available test sets with sacrebleu --list. Please see DATASETS.md for an up-to-date list of supported datasets. You can also list available test sets for a given language pair with sacrebleu --list -l en-fr.

Basics

Downloading test sets

Downloading is triggered when you request a test set. If the dataset is not available, it is downloaded and unpacked.

E.g., you can use the following commands to download the source, pass it through your translation system in translate.sh, and then score it:

$ sacrebleu -t wmt17 -l en-de --echo src > wmt17.en-de.en
$ cat wmt17.en-de.en | translate.sh | sacrebleu -t wmt17 -l en-de

Some test sets also have the outputs of systems that were submitted to the task. For example, the wmt/systems test set.

$ sacrebleu -t wmt21/systems -l zh-en --echo NiuTrans

This provides a convenient way to score:

$ sacrebleu -t wmt21/system -l zh-en --echo NiuTrans | sacrebleu -t wmt21/systems -l zh-en

You can see a list of the available outputs by passing an invalid value to --echo.

JSON output

As of version >=2.0.0, sacreBLEU prints the computed scores in JSON format to make parsing less painful:

$ sacrebleu -i output.detok.txt -t wmt17 -l en-de

{
 "name": "BLEU",
 "score": 20.8,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}

If you want to keep the old behavior, you can pass -f text or export SACREBLEU_FORMAT=text:

$ sacrebleu -i output.detok.txt -t wmt17 -l en-de -f text
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)

Scoring

(All examples below assume old-style text output for a compact representation that save space)

Let's say that you just translated the en-de test set of WMT17 with your fancy MT system and the detokenized translations are in a file called output.detok.txt:

# Option 1: Redirect system output to STDIN
$ cat output.detok.txt | sacrebleu -t wmt17 -l en-de
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)

# Option 2: Use the --input/-i argument
$ sacrebleu -t wmt17 -l en-de -i output.detok.txt
BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)

You can obtain a short version of the signature with --short/-sh:

$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -sh
BLEU|#:1|c:mixed|e:no|tok:13a|s:exp|v:2.0.0 = 20.8 54.4/26.6/14.9/8.7 (BP = 1.000 ratio = 1.026 hyp_len = 62880 ref_len = 61287)

If you only want the score to be printed, you can use the --score-only/-b flag:

$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -b
20.8

The precision of the scores can be configured via the --width/-w flag:

$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -b -w 4
20.7965

Using your own reference file

SacreBLEU knows about common test sets (as detailed in the --list example above), but you can also use it to score system outputs with arbitrary references. In this case, do not forget to provide detokenized reference and hypotheses files:

# Let's save the reference to a text file
$ sacrebleu -t wmt17 -l en-de --echo ref > ref.detok.txt

# Option 1: Pass the reference file as a positional argument to sacreBLEU
$ sacrebleu ref.detok.txt -i output.detok.txt -m bleu -b -w 4
20.7965

# Option 2: Redirect the system into STDIN (Compatible with multi-bleu.perl way of doing things)
$ cat output.detok.txt | sacrebleu ref.detok.txt -m bleu -b -w 4
20.7965

Using multiple metrics

Let's first compute BLEU, chrF and TER with the default settings:

$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -m bleu chrf ter
        BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 <stripped>
      chrF2|nrefs:1|case:mixed|eff:yes|nc:6|nw:0|space:no|version:2.0.0 = 52.0
TER|nrefs:1|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.0.0 = 69.0

Let's now enable chrF++ which is a revised version of chrF that takes into account word n-grams. Observe how the nw:0 gets changed into nw:2 in the signature:

$ sacrebleu -t wmt17 -l en-de -i output.detok.txt -m bleu chrf ter --chrf-word-order 2
        BLEU|nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0 = 20.8 <stripped>
    chrF2++|nrefs:1|case:mixed|eff:yes|nc:6|nw:2|space:no|version:2.0.0 = 49.0
TER|nrefs:1|case:lc|tok:tercom|norm:no|punct:yes|asian:no|version:2.0.0 = 69.0

Metric-specific arguments are detailed in the output of --help:

BLEU related arguments:
  --smooth-method {none,floor,add-k,exp}, -s {none,floor,add-k,exp}
                        Smoothing method: exponential decay, floor (increment zero counts), add-k (increment num/denom by k for n>1), or none. (Default: exp)
  --smooth-value BLEU_SMOOTH_VALUE, -sv BLEU_SMOOTH_VALUE
                        The smoothing value. Only valid for floor and add-k. (Defaults: floor: 0.1, add-k: 1)
  --tokenize {none,zh,13a,char,intl,ja-mecab,ko-mecab}, -tok {none,zh,13a,char,intl,ja-mecab,ko-mecab}
                        Tokenization method to use for BLEU. If not provided, defaults to `zh` for Chinese, `ja-mecab` for Japanese, `ko-mecab` for Korean and `13a` (mteval) otherwise.
  --lowercase, -lc      If True, enables case-insensitivity. (Default: False)
  --force               Insist that your tokenized input is actually detokenized.

chrF related arg

Sacrebleu

Install / Use

README