Do Bigger Evaluation Datasets Make Your Results More Significant?
The size of the test set shouldn’t have any impact on the evaluation, provided that the test set has been correctly created. Increasing its size shouldn’t change the p-value of…
The size of the test set shouldn’t have any impact on the evaluation, provided that the test set has been correctly created. Increasing its size shouldn’t change the p-value of…
Are we at a turning point? My conclusions from the annotation of 1,000+ scientific papers
Since 2010, 100+ automatic metrics have been proposed to improve machine translation evaluation. In this article, I present the most popular metrics that are used as alternatives, or in addition,…
A very robust machine translation system.
BLEU is an extremely popular evaluation metric for AI. It was originally proposed 20 years ago for machine translation evaluation, but it is nowadays commonly used in many natural language processing (NLP)…
But how good is PaLM at translation compared to the standard machine translation encoder-decoder approach?
In this article, we will go back 20 years ago to expose the main reasons that brought BLEU to existence and made it a very successful metric. We will look…
Whisper is evaluated on 6 tasks (section 3 of the research paper). I demonstrate that the conclusions drawn from 3 of these evaluation tasks are flawed ❌ or misleading ❌.
A rule of thumb may yield correct results but can’t be scientifically credible. Illustration by the author. Take any research paper or blog post presenting a new method for AI,…
Like every year since 2006, the Conference on Machine Translation (WMT) organized extensive machine translation shared tasks. Numerous participants from all over the world submitted their machine translation (MT) outputs…