Tunable text sanitization, penalized oversmoothing, and cross-lingual language model for unseen languages
The AACL 2022 was held jointly with the IJCNLP from the 20th to the 23rd of November.
This was the second edition of the AACL, the Asian chapter of the Association for Computational Linguistics (ACL). It was held exclusively online (originally planned to happen in Taiwan).
Since the AACL is still a young conference, it receives much fewer submissions than its European and North American counterparts. Out of the 554 papers submitted, 147 were accepted.
This is a 26.5% acceptance rate, which is not so far from the acceptance rate of the other conferences organized by the ACL. In other words, AACL is a very selective conference.
I reviewed the proceedings of the conference and picked several papers that I found particularly interesting.
How do I select my highlights?
My highlights are not necessarily the best papers of the conference.
I mainly select papers that are tackling trending topics, such as fairness in AI and cross-lingual NLP.
I also try to select papers that draw conclusions applicable to a wide range of NLP applications, and/or that I think will have a significant impact on future work.
This selection is also biased by my own expertise in natural language generation and machine translation, i.e, I am more attentive to the papers published in these areas.
Neural Text Sanitization with Explicit Measures of Privacy Risk
by Anthi Papadopoulou (University of Oslo), Yunhao Yu (École Polytechnique), Pierre Lison (Norwegian Computing Center), and Lilja Øvrelid (University of Oslo)
NLP models are more and more data-hungry. To avoid transferring private information to the model, and prevent privacy violations when using the model, text sanitization should be applied. This is the task of masking any private information in text, such as names and places, i.e., information that could help to identify an individual.
Text sanitization is a particularly challenging task since many clues that may look uninformative when taken individually could lead to personal identification when put together. Identifiers can be categorized as follows:
- Direct: name, social network identifier, car plate number, etc.
- Indirect: date of birth, ethnicity, nationality, geolocalisation data points, etc.
In this work, the authors propose a different approach from previous work. It introduces a new parameter to control the trade-off between the protection of privacy and the usefulness of the data. Indeed, previous works usually either “damage” the text by removing a lot of useful information (indirect identifiers) to protect privacy or are not aggressive enough to guarantee privacy protection.
This paper is very convincing on the need for a trade-off parameter. It is easy to imagine documents that require a more or less aggressive sanitization depending, for instance, on the domains (medical records, product reviews, etc.).
Their approach is decomposed into 3 steps, as illustrated below:
They first apply an entity recognizer specially adapted to focus on entities that could be direct or indirect identifiers. Then, using a combination of 3 different components, the method identifies which entities are the most likely to identify individuals. The outputs of these components are combined to issue a score that can be used to tune the aggressiveness of the sanitization.
I think this is an impressive work. It remains simple and intuitive while providing a tunable alternative to standard text sanitization methods.
Is Encoder-Decoder Redundant for Neural Machine Translation?
by Yingbo Gao, Christian Herold, Zijian Yang, and Hermann Ney (RWTH Aachen University)
Standard neural machine translation (NMT) uses an encoder-decoder architecture. In contrast, generative pre-trained language models only use an auto-regressive decoder, without any separate encoder, but are still able to perform translation. For instance, GPT-3 can translate.
This work by Gao et al. proposes to investigate whether the encoder and the decoder are redundant in machine translation. In other words, can’t we just simplify the NMT architecture by concatenating the source sentence and target sentences, and then train a model on the resulting sequence with a language modeling objective?
To answer this question, they performed extensive experiments on very diverse language pairs: German-English, English-Romanian, English-Chinese, and with a multilingual setting in which languages (French, German, and Spanish) are all mixed in the same training data.
Their main observations are as follows:
- Translation as a language modeling task performs on par with the standard encoder-decoder architecture.
- The auto-encoding and the cross-entropy losses yield similar results
- A full attention mask is necessary for the source
- BERT-style noise is marginally helpful to better train the model
- The language model should have as many parameters as the encoder-decoder version to achieve a similar performance
It simplifies a lot of the current architecture of NMT systems. We can expect translation to be more often tackled as a language modeling task in the future.
One limitation of this work resides in the evaluation. They only used BLEU and TER, two very limited metrics to conclude. Given the small differences between the scores reported and the poor correlation with human judgments for these two metrics, it is possible that the conclusions drawn by the authors don’t hold in other translation tasks.
My overview of BLEU, BLEU: A Misunderstood Metric from Another Age
Cross-lingual Few-Shot Learning on Unseen Languages
by Genta Winata (Bloomberg), Shijie Wu (Bloomberg), Mayank Kulkarni (Amazon Alexa AI), Thamar Solorio (Bloomberg), and Daniel Preotiuc-Pietro (Bloomberg)
Can large language models be applied to languages unseen in the training data?
While the cross-lingual ability of language models is a trending research topic, their ability to model unseen languages largely remains understudied.
This work investigates how few-shot learning performs for languages unseen during pre-training, using 3 different multilingual models: XLM-R, mT5, and XGLM, applied to a task of sentiment classification (NusaX).
They found that fine-tuning the pre-trained model is the best strategy if we have at least 15 examples for few-shot learning. In-context learning and prompting the model with examples at inference time seem to yield lower results.
They tried XLM-R, mT5, and XGLM, but also changed at the same time the few-shot learning strategy, e.g., mT5 is never fine-tuned in the experiments but only prompted with a few examples. XLM-R is fine-tuned but not prompted with examples. The author concludes that fine-tuning is the best strategy, but it may be only due to XLM-R itself rather than fine-tuning.
Nonetheless, the fact that a language model can model a language unseen during pre-training, using only few-shot learning, is truly remarkable.
Characterizing and addressing the issue of oversmoothing in neural autoregressive sequence modeling
by Ilia Kulikov, Maksim Eremeev, and Kyunghyun Cho (New York University)
Autoregressive neural models may assign a very high probability to unreasonably short sequences. In the extreme case, the sequence may even be empty.
It is possible to quantify how often unreasonably short sequences are generated, i.e., when the probability mass is oversmoothed towards short sequences.
The authors of this paper propose to make this quantification with an “over smoothing rate”.
They first show that models are often exhibiting a high oversmoothing rate. They generate sequences shorter than they should.
Then, they propose a method to reduce the oversmoothing rate during training for models using a negative log-likelihood objective.
Concretely, they add a regularization component in the form of an oversmoothing loss. This loss has two terms: The first term further minimizes the probability of the EOS token (i.e., the token that stops the generation when generated) for short sequences, while the second term prevents a short sequence from being more likely than a full sequence. For the mathematical details, I invite you to read the paper (section 3.1).
Note that the authors recommend training first the model without the oversmoothing loss since it tends to promote degenerated sequences. The oversmoothing loss can be then introduced, in addition to the original negative log-likelihood training objective, in a follow-up training of the model. The contribution of the oversmoothing loss during this second training can be tuned with a weight α.
The proposed method is evaluated in machine translation. As we can see in the following figure, the method indeed strongly reduces the oversmoothing rate.
In my opinion, what is the most striking in their results is the benefit of the oversmoothing loss for decoding with a large beam size. As illustrated below, BLEU scores improve significantly, especially for a beam of 100, for higher α.
They show that the sentence lengths are much closer to the lengths of the reference sentences.
This work is also a significant advance toward better beam search decoding in deep learning.
Systematic Evaluation of Predictive Fairness
by Xudong Han (The University of Melbourne), Aili Shen (Amazon Alexa AI), Trevor Cohn (The University of Melbourne), Timothy Baldwin (The University of Melbourne), and Lea Frermann (The University of Melbourne, MBZUAI)
I found this work particularly interesting for the many questions it answers on the effectiveness of debiasing methods applied to datasets.
In short, their main conclusion is that there is not a universal best method for removing bias in a dataset.
The effectiveness of removing biases largely depends on the data conditions, rather than on the method used for debiasing.
In this work, the data conditions are a set of properties that quantify in a dataset how balanced are the representation of demographic subgroups, stereotypes, etc. Intuitively, if these conditions are not well quantified, the evaluation of a debiasing method may lead to wrong conclusions.
The experiments to assess the impact of these conditions are extensive and span multiple tasks: binary and multi-class classification, and regression.
According to the results presented, it is not possible to properly evaluate debiasing methods on the standard benchmarks used by previous work. Standard benchmarks provide an incomplete picture that will favor a specific method depending on the data conditions.
It implies that previous work should be evaluated again by better taking into consideration the data conditions.
I think this is a very significant advance in the field of fairness in NLP. The conclusion of this paper will encourage better evaluation practices for debiasing methods and the creation of new datasets for benchmarking.
One of the limits of this work is that it is mainly focused on “group fairness”, i.e., how a model is fair regarding its performance independently of demographic subgroups. Other types of bias are not studied. This is acknowledged by the authors who assume that their conclusions should apply to other types of bias.
The lack of theory is painful: Modeling Harshness in Peer Review Comments
by Rajeev Verma, Rajarshi Roychoudhury (Jadavpur University), and Tirthankar Ghosal (Charles University)
This is a very interesting study that first reminds us of how “traumatic” the peer-review process can be for authors of scientific papers.
Indeed, reviewers of scientific papers are often harsh, and hurtful, and may forget that they are possibly reviewing the very first paper of a young researcher.
Believe me, I have been there, as for most research scientists. A harsh review is never helpful to improve a paper and it can take some time to find the strength and motivation to submit again.
Even though problematic reviews may be identified by a meta-reviewer or a higher program committee member, it becomes an increasingly difficult task due to the huge volume of reviews to process, as AI/ML conferences become more popular.
To help with this task, this work proposes an automatic method to evaluate the “harshness” of a review. The ultimate goal here is to flag and edit the reviews before they are sent to the authors.
This work gives the following example of a harsh review, from NeurIPS 2021:
“I do have experience with social science research, and this paper lacks insightfulness or originality from that perspective, so I recommend rejection,[…]”
“This paper will eventually be published somewhere, but it won’t have a great impact.”
It sounds harsh and it is not constructive, but I have seen much worse.
To train a model for harshness prediction, the authors compiled and publicly released a dataset of reviews annotated for their harshness.
They fine-tuned a BERT model on the dataset for the harshness classification of the reviews. They obtained an accuracy higher than 70%.
My opinion is that this kind of initiative is very much needed. The peer-reviewing process is well-known to have many problems. Having automatic methods such as this one will help to fix it. I can even imagine the harshness score being directly computed on-the-fly to warn the reviewer that the review is possibly too harsh and should be edited.
Conclusion
Of course, this is just a glance at the program of AACL-IJCNLP 2022. You will find the full proceedings there:
Asian Chapter of the Association for Computational Linguistics (2022) – ACL Anthology
I didn’t have a look yet at the papers published in “Findings”, but there are always great papers published there, as well as in the co-located workshops.