Do Bigger Evaluation Datasets Make Your Results More Significant?
The size of the test set shouldn’t have any impact on the evaluation, provided that the test set has been correctly created. Increasing its size shouldn’t change the p-value of…
The size of the test set shouldn’t have any impact on the evaluation, provided that the test set has been correctly created. Increasing its size shouldn’t change the p-value of…
Are we at a turning point? My conclusions from the annotation of 1,000+ scientific papers
Machine translation has seen many breakthroughs in its 70 years of existence. The concept of machines replacing human translators has been a topic of prediction and discussion from the inception…
Since 2010, 100+ automatic metrics have been proposed to improve machine translation evaluation. In this article, I present the most popular metrics that are used as alternatives, or in addition,…
A very robust machine translation system.
BLEU is an extremely popular evaluation metric for AI. It was originally proposed 20 years ago for machine translation evaluation, but it is nowadays commonly used in many natural language processing (NLP)…
But how good is PaLM at translation compared to the standard machine translation encoder-decoder approach?
The AACL 2022 was held jointly with the IJCNLP from the 20th to the 23rd of November. This was the second edition of the AACL, the Asian chapter of the Association for…
In this article, we will go back 20 years ago to expose the main reasons that brought BLEU to existence and made it a very successful metric. We will look…
Whisper is evaluated on 6 tasks (section 3 of the research paper). I demonstrate that the conclusions drawn from 3 of these evaluation tasks are flawed ❌ or misleading ❌.