Even your state-of-the-art system has flaws that others don’t have
New large language models (BLOOM, OPT, GPT-3, NLLB, …) are released almost every month. They are becoming more and more accessible thanks to hard work on improving the efficiency of inference, for instance with Accelerate and DeepSpeed. It is now possible to run billion parameter models on Colab or locally.
And if this is still not an option for you, commercial AI platforms with their models and APIs are also flourishing. You don’t even need to set up each API individually thanks to API aggregators, such as Eden AI and Intento, that let you query several APIs simultaneously. With all these open-source and commercial solutions, you have many options for your natural language generation task. Each model will produce a different output since they were all trained on different datasets and with different algorithms or hyperparameters. Making a sound choice can be very difficult, especially if you don’t have the knowledge or the correct method to evaluate the results for your target task.
This is exactly where minimum Bayesian risk (MBR) decoding can help you. This is an old method that can automatically retrieve one of the best outputs among all the ones that you have. It can also generate an output better than all the other ones, under some conditions.
It sounds convenient, doesn’t it? So let’s have a look at what MBR decoding is.
MBR Decoding: An Old Technique
To the best of my knowledge, MBR decoding has first been applied in automatic speech recognition by Stolcke et al. (1997), Goel and Byrne (2000). Then, it has been applied to many language generation tasks, and extensively in machine translation (MT) for which it is regularly rediscovered to work with recent model architectures and metrics. There are very interesting research papers, with scientifically credible evaluations (!), that have been published recently on this topic by Müller and Sennrich (2021), Freitag et al. (2022), and Amrhein and Sennrich (2022).
But what is MBR decoding exactly? Well, I won’t go deep into the math behind it but rather explain it with simple words.
Let’s say that you have several outputs generated by some system(s). MBR decoding will find the output that is on average the most similar to all the other ones, according to a given metric. You can understand it as minimizing the risk of using a bad output by taking instead the most consensual output among all the outputs that you have.
This technique is usually applied at inference time in a system using beam search. Indeed, since you have several hypotheses in the beam, it makes sense to rescore them with MBR to get the most consensual hypothesis according to the same metric that you will use for evaluation. Since the beam size is usually small, MBR decoding can be quite cheap.
In practice, it can also work very well when applied to select the most consensual output among the ones generated by various systems. This is what we will do in the next section.
The success of MBR decoding in this situation mainly depends on:
- The homogeneity of the quality of the outputs. If you have some outputs of high quality, but many outputs of low quality, then the most consensual output will likely be of low quality.
- The metric chosen for scoring. You should choose the metric you intend to use for evaluation, e.g., the metric that correlates the best with the human judgment on your target task. There is evidence that it will improve the metric score at evaluation time.
Application to Machine Translation
Now it’s time for the demonstration. I will apply MBR decoding for machine translation since this is the application for which it is the most documented. Nonetheless, remember that you can do the same for any language generation task.
I will use the machine translation outputs submitted to WMT21 (publicly available) for the English-to-German (en→de) and German-to-English (de→en) news translation tasks. For the metric, I choose COMET which is a state-of-the-art metric for the evaluation of translation quality.
My goal is to find the most consensual machine translation output, according to COMET, among all the outputs submitted to these translation tasks. This will be done at the segment level, i.e., MBR decoding will select for each translated sentence, the translation that is most consensual in the set of segments generated by the machine translations systems. Consequently, I will obtain a new translation for the whole dataset thanks to the combination of segments retrieved from various outputs. Thanks to this combination, we can expect to obtain a better translation than the ones submitted to WMT21.
If you want to maximize BLEU, you could use BLEU instead of COMET, but I wouldn’t recommend it. As shown by Freitag et al. (2022), you may get a slightly better BLEU but for a much lower translation quality.
For this experiment, I didn’t implement the MBR decoding by myself but used comet-mbr instead. Note that implementing your own MBR decoding for your metric isn’t difficult but usually requires some clever tricks and caching to keep the computational cost as low as possible. comet-mbr works as follows:
comet-mbr -s source.txt -t samples.txt --num_samples n -o output.txt
- source.txt: This is the source text translated by each system, with one segment per line.
- samples.txt: This is all the translations used by MBR decoding. You can make this file by merging at line-level all your outputs with:
paste -d '\n' system_1 system_2 ... system_n
- num_samples: The number n of systems combined.
- output.txt: The file that will store the result of the MBR decoding.
Note that the translation outputs submitted to these translation tasks come from very diverse and black-box systems, including commercial and online machine translation systems. We can’t have any prior expectation on the homogeneity of the translation quality, some of the outputs may be of a very low translation quality and may draw the consensus to a lower quality. In total, there are:
- 20 outputs for en→de
- 19 outputs for de→en
Let’s do naive experiments first by using all the outputs. The results are presented in the following tables.
In these two tables, “Baseline” indicates the best COMET, chrF, and BLEU scores obtained by using the single best output according to these metrics. “MBR” indicates the scores obtained by the output resulting from the MBR decoding performed with comet-mbr. Then, “Oracle” is an experiment that shows the best scores attainable by choosing for each segment the one that has the best COMET score according to WMT21 reference translation. This is oracle, i.e., we are cheating by looking at a reference translation.
For both translation directions, MBR decoding is successful. Despite naively using all the available outputs, the COMET scores increased, achieving the new highest scores for these tasks. It works. Even better, chrF and BLEU scores are also slightly improved for de→en. Nonetheless, the oracle indicates that we are still far, in terms of COMET points, from the best translation that we could attain with perfect decoding.
We can try to improve the results a little. You can expect that among all the translation outputs that you have, some are probably much worse than others and should not be considered for MBR decoding. But how to identify them?
There are many ways depending on your target task. In machine translation, we have reference-less quality estimation metrics. So we can get an idea of the quality of a particular translation relative to the others.
I used COMET-QE to score the entire translation, i.e., I didn’t use it at the segment level (but should have to get even better results), and ranked them to keep only the 5 outputs that obtained the best scores. Then, using only these 5 outputs, I perform a new MBR decoding. The new results are as follows:
While there isn’t a significant effect for de→en, for en→de keeping only the best outputs according to COMET-QE significantly improves the MBR COMET scores by 1.9 points compared to the previous naive configuration. Note also that since we consider fewer outputs, the oracle performance mechanically decreases but remains very close to the previous configuration. It means that for most segments, the ones with the best COMET scores are in the 5 outputs we kept.
Note that the results could be further improved with better strategies to select the system to keep for MBR decoding.
MBR decoding works very well in practice.
But before applying MBR decoding, you should have an idea of the performance for all the systems that generated the outputs and (roughly) keep only the best ones.
I find this method particularly useful if you have access to the outputs of many systems and don’t have the possibility or knowledge to evaluate all of them individually.
You can also find a much larger-scale experiment with MBR decoding in my evaluation of WMT22.