Expose what’s behind your scores for a more insightful and credible evaluation.
For natural language generation tasks, it is common to evaluate several models or systems against each other to identify the best one according to some metrics. In research papers, for instance, we often find tables with a metric score for each one of the systems that are compared. The system with the best score is usually considered the best system.
However, scoring systems only gives very little information on why a given system outperforms other ones. It may not be the best for the reasons we would like, or not even the best at all if the metric is not accurate. Concluding from only scores is difficult, so why should we when there are tools to get a better insight into what’s behind a score?
In this article, I present compare-mt. A very simple tool that gives the user a high-level and coherent view of the salient differences between systems. It exploits statistics usually computed by automatic metrics such as ROUGE or BLEU. It can be used to generate reports in HTML/Latex and to further guide your analysis or system improvement. It is suitable for any language generation tasks for which we have a reference text for evaluation such as machine translation and summarization.
I first present an overview of compare-mt’s main features. Then I demonstrate how it works with a concrete example by analyzing the machine translation outputs submitted to the WMT21 de-en translation task. I also show you how to interpret the results and provide some suggestions on how to report on them in your work. All the command lines I ran are provided so you can use this article also as a tutorial.
Beyond Scores
compare-mt is available on GitHub. It has been proposed in 2018 by the NeuLab at CMU.
Since then, it has been well maintained. There is also a paper describing the original version which is interesting if you would like to know the motivation behind this tool.
compare-mt only requires the system outputs to be compared and a reference text: for instance, two or more machine translation outputs and the reference translation produced by a human.
The main feature of compare-mt is that it generates an HTML report with various tables and charts comparing the systems under various criteria that you can easily add to your reports or scientific papers.
compare-mt will first score each system given the reference text. There are numerous scorers available:
- BLEU: Can be used on tokenized or detokenized texts. The detokenized version is meant to reproduce the score of SacreBLEU (but keep in mind that it does not guarantee that SacreBLEU and compare-mt yield the same scores).
- ROUGE: The usual metric to evaluate a summarization system.
- METEOR: A metric that puts more weight on recall than precision by searching for token matches among paraphrases, synonyms, etc. It requires that you get the METEOR scorer separately. METEOR was originally proposed to evaluate machine translation systems.
- chrF: Compute a score using characters rather than tokens. It is the only metric implemented in compare-mt that is tokenization independent. I recommend it if you wish your evaluation to be reproducible.
- RIBES: A metric originally proposed for machine translation tasks that is particularly good at evaluating word order.
- GLEU: A metric designed for grammatical error correction tasks.
- WER
Note: It is also supposed to run COMET but this metric is no longer supported. I’ll update this article in case that changes in the future.
Given the metric you choose, compare-mt can also tell you whether the difference between the systems you are scoring is statistically significant. For this statistical significance testing, it uses bootstrap resampling. Keep in mind that depending on the metric you chose and the number of systems you score, this bootstrap resampling can be very slow.
Another interesting feature is sentence-level analysis. For instance, compare-mt outputs sentence-level scores given the length of the sentences. This is very useful to identify whether a system performs better for shorter or longer sentences.
It also measures the word accuracy given its frequency. By default, frequencies are computed on the reference text which is not ideal, but you can provide your frequency counts computed for instance on the training data of your systems.
Another feature that I found useful is the indication of which n-gram is better translated by which system. It is very insightful, for instance, if the systems compared used different training data.
The report generated will also output examples of translations to illustrate the differences between the systems compared.
I listed only the features that I found the most useful here, but there are more if you wish to perform a deeper analysis. Most of them are listed on the Github webpage of compare-mt.
Demonstration: An Analysis of the WMT21 de-en Translation Task
For the following analysis, I ran everything on Ubuntu 20.04, with Python 3.8.
You can get compare-mt with git:
git clone https://github.com/neulab/compare-mt.git
Then you only have to follow the installation instructions provided in the README.md:
# Requirementspip install -r requirements.txt# Install the packagepython setup.py install
For this demonstration, I’m going to compare 6 machine translation outputs submitted to the WMT21 German-to-English translation task. The dataset I use is available here. It is the official dataset publicly released by the organizers of the task. I only used the “txt” version for each file.
The systems I compared are:
- Borderline
- Facebook-AI
- UF
- Online-A
- Online-W
- VolcTrans-AT
You simply have to run the following command to get the HTML report:
compare-mt —output_directory output/ newstest2021.de-en.ref.A.en newstest2021.de-en.hyp.Borderline.en newstest2021.de-en.hyp.Facebook-AI.en newstest2021.de-en.hyp.UF.en newstest2021.de-en.hyp.Online-A.en newstest2021.de-en.hyp.Online-W.en newstest2021.de-en.hyp.VolcTrans-AT.en —compare_scores score_type=sacrebleu,bootstrap=1000,prob_thresh=0.05 —decimals 2 —sys_names Borderline Facebook-AI UF Online-A Online-W VolcTrans-AT —fig_size 10x5
Where:
- “output_directory” is the directory that will store the report and the generated charts.
- “newstest2021.de-en.ref.A.en” is the reference translation followed by the 6 system outputs we want to compare. Note that I arbitrarily chose to compare 6 systems. You can compare as many systems as you want but I found that above 5 or 6 the charts and tables tend to become difficult to read.
- “compare_scores” specifies how you wish to score the system.
- “score_type” is the metric among all the metrics I listed above. Here I chose Sacrebleu since the translations I evaluate aren’t tokenized.
- “bootstrap” and “prob_thresh” are optional: If you provide them, statistical significance testing will be performed on all possible pairs of systems. 1000 for “bootstrap” is a reasonable default value as well as 0.01 for “prob_thresh” which is the p-value above which the difference between systems won’t be considered significant.
- “decimals” is the number of decimals that will be printed for each score in the report. The default is 4 but I found that it makes the report difficult to read and it can be misleading for the readers since such precision is meaningless with BLEU.
- “sys_names” are the names for each system in the same order as the outputs we are comparing.
- “fig_size” defines the figure sizes. The default value will be fine most of the time but if you compare more than 3 systems, I found that increasing the figure size is necessary to obtain a better legend positioning in the charts.
The report only takes a few seconds to be generated. The full report for this example is available here.
First, there is a bleu score for each system:
Actually, in most machine translation research papers, this is only what you’ll get: a table with BLEU scores for machine translation systems. Then, considering the highest scores we would conclude that Borderline’s submission is the best, or worse, that it is the new state of the art of machine translation… And this is precisely where compare-mt will help you to draw more credible scientific conclusions.
Let’s look at the next table. It presents the statistical significance of the differences between the BLEU scores for each possible pair of systems:
This table is more difficult to read since we have many system pairs.
We have three different cases:
- s1>s2 (p=X): The system on the row is significantly better than the system on the column with a p-value X (you can interpret this p-value as a confidence score).
- s2>s1 (p=X): The system on the column is significantly better than the system on the row with a p-value X.
- – (p=X): The system scores aren’t significantly different, i.e., the systems perform similarly.
Now that we know how to read this table, this is striking: Most systems perform similarly! There isn’t a “best system” here. Even when the difference in BLEU is close to, or above, 1 point, for instance between Borderline and Facebook-AI, the difference is not significant. Note: compare-mt uses a threshold on the p-value to decide whether the difference is significant or not. This is controversial and there are debates that such a threshold shouldn’t exist. When using p-values in your work, I recommend avoiding using a threshold and always mentioning the p-value instead. You can formulate your conclusions as in “system A is better than system B with a p-value of X.” Depending on the task a p-value of 0.1 or lower may be enough to achieve something significantly different while for other tasks you may find that 0.001 is more suitable. Also, keep in mind that not everyone working in natural language processing is convinced that statistical significance testing is meaningful.
Actually, in this case, statistical significance testing leads to the same conclusion as the human evaluation conducted by WMT: These 6 systems perform the same.
I recommend providing this table when you want to claim that a system is better than another one. compare-mt nicely generates the LaTeX code for it, just click on “show/hide LaTeX.”
The following tables and charts are useful for more fine-grained diagnostics and analysis.
The next table in the report is the word accuracies for each system given predefined frequency buckets. In this example, I didn’t find this table very useful since all the numbers are very similar for each system. A chart is also generated using the numbers from this table.
I found the next table more useful. It provides the BLEU scores given the sentence lengths. It’s interesting in this example since we can see that the systems perform differently depending on the length. For instance, while Borderline performs the best in most buckets, for sentences longer than 50 tokens other systems seem to perform much better.
The next table and chart show the differences in the number of tokens between the reference and machine translation outputs lengths. Again, these differences are provided as buckets. The most populated buckets are around “0” for all systems meaning that they could match the reference length reasonably well.
The next table and chart use buckets made from sentence-level BLEU scores. In this example, I wouldn’t recommend using these statistics. With other metrics that may be fine, but sentence-level BLEU is very sensitive to the length of the sentence: Lower BLEU scores may only mean that the sentence to translate was long rather than truly difficult. Intuitively, you will often find very short sentences, i.e., sentences for which your system is less likely to generate translation errors, in the top bucket “>=90.0” (sentences with a sentence-level BLEU score above 90).

The next tables provide even more fine-grained information but only compare the first two systems provided when you run compare-mt, here, Borderline and Facebook-AI. First, we have n-gram statistics. These tables tell us which n-gram was better translated. It can be helpful if you want to check whether a system is particularly good at translating some specific terms compared to the other system.
Then, the report lists examples of sentences with their corresponding sentence-level BLEU score. Here, I found it particularly useful to highlight the limit of the metric…
For instance, in the following example, can you find out why Borderline doesn’t get a 100 BLEU (the maximum score)?
Borderline didn’t generate the expected “ ’ ” but another similar character. Facebook-AI translation looks also as good as Borderline but got a much lower score.
A word for machine translation researchers:
As a reviewer serving top-tier conferences and journals, I can only encourage you to provide this report, or similar information if you have better tools, in your supplementary material. This can be very strong supportive evidence that your system is indeed better. Conference/journal reviewers can argue that scores alone, especially BLEU, are meaningless, but if you provide or analyze the information in the report generated by compare-mt it can transform a reject into an accept. I would be positive and note the effort.
Conclusion
I only presented the main features of compare-mt. I encourage you to take a closer look at this tool.
It doesn’t require much time to learn how to use it. It’s easy to install and fast to generate reports. So why not generate this supplementary information to confirm your observations?
If you have any questions, I’ll be happy to answer them in the comments!