More robust evaluation metrics, language models that don’t understand anything, and better evaluation for grammatical error correction
COLING 2022 was held in mid-October in Gyeongju (Republic of Korea).
This natural language processing (NLP) conference received 2253 submissions from all over the world of which only 632 (28.1%) were accepted for publication by the 1935 reviewers and 44 senior area chairs of the program committee.
For these highlights, I selected 6 papers that retained my attention.
Layer or Representation Space: What makes BERT-based Evaluation Metrics Robust?
by Doan Nam Long Vu (Technical University of Darmstadt), Nafise Sadat Moosavi (The University of Sheffield), and Steffen Eger (Bielefeld University)
Recent metrics for natural language generation rely on pre-trained language models, for instance, BERTScore, BLEURT, and COMET. These metrics achieve a high correlation with human evaluations on standard benchmarks. However, it is unclear how these metrics perform for styles and domains that aren’t well represented in their training data.
In other words, are these metrics robust?
The authors found that BERTScore isn’t robust to character-level perturbations. For instance, inserting/removing some characters from the sentences will significantly decrease the correlation with human evaluations.
The authors show that using a model with character embeddings, such as ByT5, instead of a standard BERT model, makes BERTScore more robust, especially if we use the embeddings from the first layer.
In my opinion, this is an outstanding work with possible applications to a wide range of natural language generation tasks.
What I conclude from this paper is that word embeddings-based metrics such as the original BERTScore are probably not good at evaluation for tasks involving user-generated texts, i.e., the kind of text that can contain a lot of grammatical errors, such as texts from online discussion platforms. In my opinion, this adaptation of BERTScore with ByT5 could improve the evaluation of user-generated texts.
Note: This paper received an outstanding paper award from the conference.
Grammatical Error Correction: Are We There Yet?
by Muhammad Reza Qorib (National University of Singapore) and Hwee Tou Ng (National University of Singapore)
This paper first shows that recent approaches for grammatical error correction (GEC) seem to outperform humans on standard benchmarks.
Interestingly, the authors found that the GEC systems evaluated fail to correct a significant number of the sentences of standard GEC benchmarks, which isn’t observed for humans.
GEC systems seem to fail more often at correcting unnatural phrases, long sentences, and complex sentence structures.
The authors conclude that GEC systems are still far from human performance but that current benchmarks are somewhat too easy for GEC systems. They suggest the creation of new benchmarks focusing on the grammatical errors that remain difficult to correct by GEC systems.
I especially like this work for pointing out some of the actual limits of GEC systems. While recent work praise the super-human performance of GEC systems, this paper helps to lower the expectations and motivate future research work to further improve GEC systems so that they can finally achieve a performance comparable to humans.
So to answer the title of the paper: No, GEC systems are not there yet.
Machine Reading, Fast and Slow: When Do Models “Understand” Language?
by Sagnik Ray Choudhury (University of Michigan, University of Copenhagen), Anna Rogers (University of Copenhagen), and Isabelle Augenstein (University of Copenhagen)
This is yet another work showing that large language models don’t understand anything.
They evaluated 5 language models on two linguistics skills: comparison and coreference resolution.
Their results clearly show that all the models rely on specific lexical patterns and not on the information that a human would use to perform well at these tasks.
They show that by exposing the models to out-of-distribution counterfactual perturbations. The models don’t know how to handle them and significantly underperform. Hence suggestions that these models are only memorizing lexical patterns rather than “understanding.”
I find this paper especially interesting for the approach chosen to demonstrate that computers and humans don’t process text the same way.
On the Complementarity between Pre-Training and Random-Initialization for Resource-Rich Machine Translation
by Changtong Zan (China University of Petroleum), Liang Ding (JD Explore Academy), Li Shen (JD Explore Academy), Yu Cao (The University of Sydney), Weifeng Liu (China University of Petroleum), and Dacheng Tao (JD Explore Academy, The University of Sydney)
Resource-rich machine translation is one of the tasks that have yet not benefited from pre-trained language models (LM).
With this work, the authors propose a study to better understand when and how pre-trained language models can be useful to initialize machine translation systems in resource-rich scenarios.
They first show that, while it has almost no effect on translation accuracy, an initialization with a pre-trained LM leads to flatter loss landscapes and smoother lexical probability distribution.
From these observations, they assume that an initialization with a pre-trained LM may lead to better translation accuracy for out-of-domain datasets, while a random initialization will be better at translating in-domain datasets. They confirmed these assumptions with empirical experiments.
Finally, they propose a harmonization of both pre-trained LM and random initialization to get the best of them in the same training.
I’m not sure whether this work will finally be the one that motivates the integration of pre-trained LM in resource-rich machine translation. Nonetheless, I think it is worth highlighting to show that researchers are still actively working on it.
Alleviating the Inequality of Attention Heads for Neural Machine Translation
by Zewei Sun (ByteDance AI Lab), Shujian Huang (Nanjing University, Peng Cheng Laboratory), Xin-Yu Dai (Nanjing University), and Jiajun Chen (Nanjing University)
Previous work has shown that all attention heads in machine translation aren’t equally important.
From this observation, this work proposes a novel “head mask” to force the model to better balance training across the attention heads.
Two very simple methods are described for head masking: random masking and masking important heads.
They observed (slight) improvements in BLEU for various language pairs, with both methods. Random masking seems to perform better, although it doesn’t look significant. However, important head masking better succeeds at balancing the importance between the heads during training.
I particularly like the simplicity of the method. This head masking could be easily implemented in existing machine translation frameworks.
Paraphrase Generation as Unsupervised Machine Translation
by Xiaofei Sun (Zhejiang University, Shannon.AI), Yufei Tian (University of California), Yuxian Meng (Shannon.AI), Nanyun Peng (University of California), Fei Wu (Zhejiang University), Jiwei Li (Zhejiang University, Shannon.AI), and Chun Fan (Peking University)
Most of the training datasets available to trained supervised paraphrase generation are in English, limited to a few domains, and small.
To alleviate these limitations, unsupervised paraphrase generation has been proposed in recent work. These approaches only require a lot of texts in the language of interest.
In this paper, the authors proposed a new approach inspired by unsupervised machine translation (UMT).
UMT takes training two corpora in the source and target languages. Word embeddings for both languages are first jointly learned and used to initialize a machine translation system. Then, the model of this translation system is refined using a combination of auto-encoding and back-translation losses.
For paraphrase generation, we don’t have source and target languages. They propose to work instead at the domain (or topic) level. Source and target languages from UMT become source and target domains. To obtain these domains, they perform LDA and k-means clustering on some monolingual datasets where each cluster is (potentially) a different domain. Then, they train UMT models for multiple domain pairs. Finally, a single MT model is trained on 25 million sentence pairs generated by the multiple UMT models trained earlier.
They evaluate their approach with extensive experiments to demonstrate improvements over previous work. The evaluation is very convincing:
- They used 3 different automatic metrics: iBLEU, BLEU, and ROUGE
- They reproduced several baseline systems from previous work
- They experimented on 4 different benchmarks
- They performed a human evaluation
As someone who has worked a lot on unsupervised machine translation technology, I was expecting it to be applied at some point to paraphrasing but couldn’t find how. In this paper, the clustering into topics/domains of the monolingual data seems to be the main reason why it works.
While the improvements are convincing, I can’t understand why it works so well. The need for clustering isn’t very well motivated in the paper but it seems to be a crucial part of this work. Also, they don’t discuss in detail why their approach is better than previous work. What are the limits of previous unsupervised paraphrase generation methods that this work addresses?
I selected here only a few papers among the 632 papers published. I encourage you to have a closer look at the full proceedings and workshops.
If you are interested in recent advances in machine translation, you can also have a look at my AMTA 2022 highlights: