Do Bigger Evaluation Datasets Make Your Results More Significant?
The size of the test set shouldn’t have any impact on the evaluation, provided that the test set has been correctly created. Increasing its size shouldn’t change the p-value of…
The size of the test set shouldn’t have any impact on the evaluation, provided that the test set has been correctly created. Increasing its size shouldn’t change the p-value of…
Are we at a turning point? My conclusions from the annotation of 1,000+ scientific papers
In this article, we will go back 20 years ago to expose the main reasons that brought BLEU to existence and made it a very successful metric. We will look…
Whisper is evaluated on 6 tasks (section 3 of the research paper). I demonstrate that the conclusions drawn from 3 of these evaluation tasks are flawed ❌ or misleading ❌.