Evaluation Metrics in Natural Language Processing — BLEU

Priyanka
7 min readNov 14, 2022

In this series of posts, we are going to discuss evaluation metrics that are specific to use-cases in the area of Natural Language Processing. The most common tasks in NLP are automatic summarization, question-answering, machine translation. The goal of these metrics is to identify the quality of predicted text given the input text. The predicted text is referred to as Candidate and the possible correct or target texts are called References.

These metrics are based on some of the basic metrics like Recall, Precision and F1-Score. If you are not aware of these metrics, check out this article covering them: Recall, Precision, F1-Score. The concept of n-grams is also essential for calculation and understanding of these metrics.

The metrics below are usually used for Machine Translation or Automatic Summarization but they can be applied to any other task that involves input and target text pairs. Along with evaluation of the models, these metrics can also be used for hyperparameter tuning of Machine Learning models. In the first post, we will discuss the BLEU metric that is often used to evaluate Machine Translation.

Bilingual Evaluation Understudy(BLEU)

BLEU score measures the quality of predicted text, referred to as the candidate, compared to a set…

--

--