185 systems evaluated in the 21 translations directions of WMT22
Like every year since 2006, the Conference on Machine Translation (WMT) organized extensive machine translation shared tasks. Numerous participants from all over the world submitted their machine translation (MT) outputs to demonstrate their recent advances in the field. WMT is generally recognized as the event of reference to observe and evaluate the state of the art of MT.
The 2022 edition replaced the original news translation task with a “general” translation task covering various domains, including news, social, conversational, and e-commerce, among others. This task alone received 185 submissions for the 21 translation directions prepared by the organizers: Czech↔English (cs-en), Czech↔Ukrainian (cs-uk), German↔English (de-en), French↔German (fr-de), English→Croatian (en-hr), English↔Japanese (en-ja), English↔Livonian (en-liv), English↔Russian (en-ru), Russian↔Yakut (ru-sah), English↔Ukrainian (en-uk), and English↔Chinese (en-zh). These translation directions cover a wide range of scenarios. They are classified as follows by the organizers in terms of the relatedness of the languages and the number of resources available for training an MT system:
With this variety of language pairs combined with the variety of domains, we can draw an accurate picture of the current state of machine translation.
In this article, I report on the automatic evaluation of the 185 submissions, including the online systems added by the organizers. My main observations are as follows:
- MT for low-resource distant language pairs is still an extremely difficult task.
- The best outputs submitted are very far from the translation quality delivered by online systems for some of the translation directions (e.g., de→fr).
- A BLEU score difference between two MT systems that is higher than 0.9 is always statistically significant in this task.
- BLEU poorly correlates with COMET for translation quality evaluation for almost all translation directions but remains useful as a tool for diagnostics and analysis.
- Absolute COMET scores are meaningless.
For this study, I used the reference translations and system outputs publicly released by WMT22’s organizers and could cross-check some of my results thanks to the preliminary report released by Tom Kocmi.
This is not an official evaluation of WMT22. WMT22 is conducting a human evaluation that will be presented in detail at the conference on December 7–8, 2022, which is co-located with the EMNLP 2022 in Abu Dhabi.
Note that this article is a more digestible and shorter version of my recent report that you can find on arXiv: An Automatic Evaluation of the WMT22 General Machine Translation Task.
Scoring and Ranking with Metrics
For this evaluation, I used three different automatic metrics:
• chrF (Popović, 2015): A tokenization-independent metric operating at character level with a higher correlation with human judgments than BLEU. This is the metric I usually recommend for evaluating translation quality since it is very cheap to compute, reproducible, and applicable to any language.
• BLEU (Papineni et al., 2002): The standard BLEU.
• COMET (Rei et al., 2020): A state-of-the-art metric based on a pre-trained language model. We used the default model “wmt20-comet-da.”
Note that in this particular study, chrF and BLEU are merely used for diagnostic purposes and to answer the question: How far are we from reaching particular reference translations? I won’t use them to conclude the translation quality. For this purpose, I use COMET to produce rankings of systems that would better correlate with a human evaluation.
I ranked the systems for each translation direction given their scores but I assigned a rank only to the systems that have been declared “constrained” by their authors, i.e., systems that only used the data provided by the organizers. In the following tables, the systems with a rank “n/a” are systems that are not constrained.
Having two reference translations for evaluation, we obtain absolute BLEU scores rarely seen in the machine translation research literature with, for instance, 60.9 BLEU points for JDExploreAcademy for cs→en, as follows:
Even higher BLEU scores are observed for en→zh due to the use of smaller tokens that makes the 4-gram matching a much easier task:
Absolute BLEU scores do not inform us of the translation quality itself, even scores above 60 don’t necessarily mean that the translation is good since BLEU depends on many parameters. However, BLEU does inform us that these systems produce a lot of 4-grams that are in the reference translations.
While chrF and BLEU directly indicate how well the translation matches the references with a score between 0 and 100 points, COMET scores are not bounded. For instance, at the extremes, AMU obtains 104.9 COMET points for uk→cs and AIST obtains -152.7 COMET points for liv→en. I was surprised by this amplitude and had to recheck how COMET is computed before validating these scores (more details on below in the section “A note about COMET”).
For 11 among the 21 language pairs, COMET finds the best system that is not among the best systems found by BLEU and chrF. Surprisingly, for some translation directions, constrained systems outperform systems that are not constrained. According to COMET, this is the case for cs→uk, uk→cs, de→en, ja→en, and en→ja. For some other directions, online systems seem to be better by a large margin. For instance, for de→fr, Online-W is better than the best constrained system by 18.3 BLEU points.
My main takeaway from these rankings is that using data not provided by WMT22 is the key to getting the best systems. Of course, this is not surprising, but I hope that the participants will fully describe and analyze their datasets to better understand why they are so important.
Statistical Significance Testing
Now that we have scores for each system, we would like to measure how reliable the conclusion that a system is better than another one according to some metric. In other words, we would like to test whether the difference between systems’ metric scores is statistically significant. There are several tools and techniques to perform statistical significance testing. For this evaluation, I chose the most commonly used: paired bootstrap resampling as originally proposed by Koehn (2004).
A first interesting observation is that a difference in BLEU higher than 0.9 points (cs→uk) is always significant with a p-value < 0.05. Given the relatively high and arguable threshold I used for the p-value, I found 0.9 to be quite high since most research MT papers would claim their systems significantly better for a difference in BLEU higher than 0.5.
In chrF, the largest difference that is not significant is 0.6 points (en→zh), while it reaches 2.6 points (liv→en) for COMET. Note that this would highly vary depending on the model used with COMET.
The three metrics only agree on a system that is significantly better than all the others for 5 translation directions among 21: cs→en (Online-W), fr→de (Online-W), en→liv (TAL-SJTU), sah→ru (Online-G), and en→uk (Online-B).
My main takeaway from this statistical significance testing is that it is insightful. This is regularly debated in the MT research community but I think this is a necessary tool. For very well-known metrics such as BLEU, researchers usually apply the rule of thumb that a difference of 1.0 or more, for instance, is statistically significant. That may be correct, albeit not scientifically credible until tested. Nonetheless, how about new metrics that we don’t know well? Is a 1.0 COMET points difference significant? It depends on the task and the COMET model (as we will see below). This is why statistical significant testing must be performed before claiming that a system is better than another one. The amplitude of the difference between the scores of two systems should be considered meaningless.
I also experimented with normalized translation outputs to observe how BLEU and COMET are sensitive to changes in punctuation marks and encoding issues. It can also highlight whether a system relied on some special post-processing to increase the metric scores. For normalization, I used the following sequence of Moses scripts:
tokenizer/replace-unicode-punctuation.perl | tokenizer/normalize-punctuation.perl -l <target_language> | /tokenizer/remove-non-printing-char.perl
As expected, I found that COMET is almost insensitive to this normalization. On the other hand, it has a stronger impact on the BLEU scores, but it can greatly vary from one system to another. For instance, for en→cs, it does not affect JDExploreAcademy while the score of Online-Y drops by 1.4 BLEU points. For de→fr, the normalization increases the BLEU score of Online-A by 4.9 points and becomes better than Online-W for which the normalization does not affect BLEU. Nonetheless, Online-W remains around 10 COMET points better than Online-A.
Nothing unexpected here, but a great reminder of why BLEU can be very inaccurate as an evaluation metric for translation quality.
The Peculiarity of COMET
BLEU and chrF absolute scores can be used for diagnostic purposes and answer basic questions: How close are we to the reference with a given tokenization? Has the system likely generated text in the target language? etc. COMET cannot, but is much more reliable for ranking systems as demonstrated in previous work.
Since I observed large amplitudes between COMET scores, I experimented with several COMET models to observe how scores vary across them.
I could observe that wmt20-comet-da (the default model) scores are quite different from all the other models. While the maximum score obtained by a system with wmt20-comet-da is 104.9 (uk→cs), the scores obtained with the other 4 models never exceed 15.9 for all translation directions. More particularly, with wmt21-comet-da, for ja→en the best system is scored at 1.1, as illustrated in the following table.
Even more peculiar, for zh→en, wmt21-comet-da scores are negative for all the systems:
With wmt21-comet-mqm, the systems’ scores look all very close to each other when rounded.
I conclude that absolute COMET scores are not informative whatever model we use. Negative COMET scores can be assigned to excellent machine translation systems.
This evaluation clearly shows that some translation directions are easier than others. However, what I found the most interesting after running all these experiments is that I have no clue how good the systems are! BLEU and chrF will only tell us how close we are to a particular reference translation, but the absolute scores can vary a lot given the tokenization used. COMET is only useful for ranking systems. To the best of my knowledge, in 2022, we still don’t have an automatic evaluation metric for MT that is:
- informative on the translation quality, i.e., not only accurate for ranking systems;
- and that would yield scores comparable across different settings such as domains, language pairs, tokenizations, etc.
Thanks to BLEU and chrF, we can observe that we are somewhat close to the reference translations for some translation directions like cs→en and en→zh, but still very far for others such as en↔liv and ru↔sah. COMET, on the other hand, shows that WMT22 systems are significantly better than the online systems for only 5 among 19 translation directions (I left out en↔liv): cs→uk (AMU), uk→cs (AMU), de→en (JDExploreAcademy), en→ja (JDExploreAcademy, NT5, LanguageX), and en→zh (LanguageX).
It will be interesting to observe whether these findings correlate with the human evaluation conducted by WMT22.
I only highlighted the main findings of my evaluation. There is more, especially an attempt of combining all the systems submitted, in my submission to arXiv.
I would like to thank the WMT organizers for releasing the translations and Tom Kocmi for providing preliminary results as well as insightful comments and suggestions on the first draft of my arXiv report.