The state of the art of machine translation for users and researchers

Orlando, Florida
Orlando, Florida — Photo by Cody Board on Unsplash

Every two years, the machine translation (MT) community meets and exchanges about recent advances in the field at the AMTA conference, the North American component of the International Association for Machine Translation (IAMT). This is always a very interesting event for people involved in machine translation where researchers, users, the industry, and even governmental organizations publish research papers or present their work. The 2022 edition of AMTA happened in September in Orlando, Florida.

In this article, I highlight and sum up the papers that I found the most original and interesting. I picked papers from the users (see the proceedings) and the research (see the proceedings) tracks.

Picking Out the Best MT Model: On the Methodology of Human Evaluation

by Stepan Korotaev (Effectiff) and Andrey Ryabchikov (Effectiff)

The key assumption in this paper is that two or more translated texts of the same length should take approximately the same effort to post-edit if they are translated from different but homogeneous source documents.

Two documents are considered “homogeneous” if:

  • They are of the same domain and genre.
  • They have similar complexity and/or readability scores computed with some selected metrics.
  • They are close in the density of specialized terminology.
  • They should only have very few overlapping specialized terms.

They define the “effort” to post-edit as:

  • time spent
  • edit distance
  • percentage of changed segments

Then, if we have homogeneous documents translated, and one of the translations requires less effort to post-edit, we can conclude that this translation has been generated by a better MT system.

This is very intuitive and the authors show evidence that their assumption is correct on an English-to-Russian translation task.

They also acknowledge the limits of their work, e.g., “time spent” is never a very reliable metric since the post-editors themselves are responsible for measuring it.

All You Need is Source! A Study on Source-based Quality Estimation for Neural Machine Translation

by Jon Cambra Guinea (Welocalize) and Mara Nunziatini (Welocalize)

This is another original work from the users track of the conference. It proposes a different approach for quality estimation (QE) of MT. QE is the process of automatically evaluating the quality of a translation without using any human translation. You could say it is an unsupervised evaluation task. This is a very well-studied problem but the originality of the proposed approach is that it can perform QE before the translation is done!

Indeed, this method only exploits the source text to translate and the training data used to train the MT system. The assumption here is that if we know the training data used by the MT system, we should be able to guess how well it will translate a given source text.

In practice, the paper shows that this approach correlates relatively well with state-of-the-art QE metrics such as COMET-QE. Of course, standard QE metrics remain much more accurate but the proposed approach has several advantages that makes it useful in various situations. For instance, it can be used to evaluate the difficulty of translating a given source text, to prioritize and better plan post-editing before it even starts, etc.

One of the main limits of this work is that we need to know the training data of the MT system. It does not apply to black-box MT systems.

Boosting Neural Machine Translation with Similar Translations

by Jitao Xu (Systran, LIMSI), Josep Crego (Systran), and Jean Senellart (Systran)

Neural MT requires a lot of training data, i.e., translations created by humans in the target domain and language pair. For most use cases, we don’t have enough training data to train an accurate MT system in the target domain.

One way to mitigate the lack of training data is to exploit a “translation memory”: translations previously produced by humans in the same domain and language pair. Then, when translating a sentence, we can check whether there is already a translation in the memory for this sentence. This is the ideal scenario but most of the time we translate new texts that are not in the memory. In this situation, we can leverage “fuzzy matches.” A fuzzy match is defined as a new sentence that is similar to another one in the translation memory.

Even though a fuzzy match can be quite different from the actual sentence that we want to translate, this work proposes several methods to exploit fuzzy matches to improve the translation quality. They show how to feed the neural model with information on both the source and target sides of the fuzzy matches. This is illustrated in the following table for an English-to-French translation:

Screenshot of Figure 2 by Jitao Xu (Systran, LIMSI), Josep Crego (Systran), and Jean Senellart (Systran).

They propose 3 methods to exploit fuzzy matches. The method FM+ is the one that provides the best results. It keeps the entire fuzzy match unchanged but augments it with tags:

  • for source words;
  • for unrelated target words;
  • and T for related target words.

I found FM* performs surprisingly low. There is some similarity with what I proposed at NAACL 2019 in my paper: Unsupervised Extraction of Partial Translations for Neural Machine Translation. In my work, I denoted it “partial translations” instead of “fuzzy matches” where I masked (or dropped) the untranslated tokens. Here, Systran masks them with the token “∥”. I am not sure why they chose this token which is also used to separate the source and target sentences. I expect the model to be confused about whether this token announces a target sentence or masks an irrelevant text.

The performance of FM+ looks impressive, even though it has only been evaluated with BLEU. Part of this work is open source:

A Comparison of Data Filtering Methods for Neural Machine Translation

by Fred Bane (Transperfect), Celia Soler Uguet (Transperfect), Wiktor Stribizew (Transperfect), and Anna Zaretskaya (Transperfect)

An MT system trained on noisy data may underperform. Filtering the training data to remove the noisiest sentence pairs is almost always necessary. This paper presents an evaluation of different existing filtering methods that identify the types of noise defined by Khayrallah and Koehn (2018):

  • MUSE: Compute sentence embeddings from the MUSE word embeddings for the source and target sentence and then score the sentence pair with cosine similarity.
  • Marian Scorer: Score the sentence pair with a neural MT model.
  • XLM-R: Compute multilingual sentence embeddings for the source and target sentence and then score the sentence pair with cosine similarity.
  • LASER: Get the multilingual sentence embeddings given by LASER and then score the sentence pair with a cosine similarity.
  • COMET: Use the wmt-20-qe-da model for quality estimation to score the sentence pair.

They found that the Marian scorer is the best tool to filter the sentence. This is not very surprising to me since this scorer is the only tool that exploits a model trained on their data. Nonetheless, the paper is extremely convincing thanks to an evaluation well above the standard of machine translation research:

  • They used different automatic metrics: BLEU, TER, and chrF.
  • The computed scores can be cited in future work thanks to the use of SacreBLEU.
  • They performed statistical significance testing.
  • They performed a human evaluation with the MQM framework.

Following the scale I proposed in my ACL 2021 paper, their evaluation would get a meta-evaluation score of 4 which is the maximum.

How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?

by Ali Araabi (University of Amsterdam), Christof Monz (University of Amsterdam), and Vlad Niculae (University of Amsterdam)

This paper presents an overdue study on how well BPE mitigates the difficulty of translating words that are not in the training data (OOV).

Technically when using BPE there are no OOV since the words are decomposed into smaller BPE tokens that are all in the MT model vocabulary. Nonetheless, the sequence of the BPE tokens that forms the OOV word remains unseen in the training data.

Among various interesting findings, I first retain that some types of OOV words are better translated thanks to the use of BPE, especially name entities. For the other types of OOV, BPE also helps but not significantly. Moreover, in their attempt to better understand how BPE helps, the authors demonstrated that the translation quality of OOV words is strongly correlated with the amount of Transformer’s attention they received.

The paper highlights yet another weakness of BLEU in evaluating translation quality. As demonstrated by Guillou et al. (2018) at WMT18, BLEU is almost insensitive to local errors. Consequently, when an OOV word is not translated correctly and without any impact on the remainder of the translation, it will only have a very small impact on the BLEU score. Instead of BLEU, the authors recommend human evaluation to accurately evaluate the translation of OOV words.

Consistent Human Evaluation of Machine Translation across Language Pairs

by Daniel Licht (META AI), Cynthia Gao (META AI), Janice Lam (META AI), Francisco Guzman (META AI), Mona Diab (META AI), and Philipp Koehn (META AI, Johns Hopkins University)

I highlight this paper for the very thorough and straightforward human evaluation framework it proposes. It is so well designed that it holds in one page, with examples, as follows:

Screenshot of Figure 1 by Daniel Licht (META AI), Cynthia Gao (META AI), Janice Lam (META AI), Francisco Guzman (META AI), Mona Diab (META AI), and Philipp Koehn (META AI, Johns Hopkins University).

More particularly, the scoring obtained with this framework (denoted XSTS) is focused on achieving meaningful scores for ranking MT systems. The framework has been evaluated on a large number of language pairs.


I only highlighted the most original/interesting papers to me. I encourage you to have a closer look at the proceedings of the conference. Note also that there were several workshops focused on very particular MT topics that I didn’t cover at all in this article.

Leave a Reply

Your email address will not be published. Required fields are marked *