12 Critical Flaws of BLEU
Why you shouldn’t trust BLEU according to 37 studies published over 20 years
BLEU is an extremely popular evaluation metric for AI.
It was originally proposed 20 years ago for machine translation evaluation, but it is nowadays commonly used in many natural language processing (NLP) tasks. BLEU has also been recently used to evaluate large…