BERTScore Explained in 5 minutes

Evaluating Text Generation with BERT: An Overview of BERTScore

Abonia Sojasingarayar
5 min readJan 15, 2024
BERTScore — Tutorial

BERTScore is a significant metric that has emerged as an alternative to traditional evaluation metrics in the field of Natural Language Processing (NLP). It is particularly useful for evaluating the quality of text summarization, measuring how similar the text summary is to the original text. This article will delve into the motivation behind BERTScore, explain its detailed architecture, and provide a sample Colab notebook implementation.

📌 𝑷𝒂𝒑𝒆𝒓 𝒍𝒊𝒏𝒌: https://arxiv.org/abs/1904.09675
📌 𝑮𝒊𝒕𝒉𝒖𝒃: https://github.com/Tiiiger/bert_score

Motivation

With the advancement of NLP and Large Language Models (LLMs), a new problem has arisen:

How reliable are the accuracy of evaluation metrics?

Traditional evaluation metrics like n-gram-based models have limitations. They tend to incorrectly match paraphrases because semantically accurate expressions may differ from the surface form of the reference text, leading to incorrect performance estimation. Furthermore, n-gram models cannot capture long-range dependencies and penalize semantically significant reordering.

  1. Inability to detect paraphrases:

For example, with the reference text “people like Western cuisine.,” n-gram-based metrics would assign a higher score to “people like global flavors” instead of “consumers prefer imported spices.”. This leads to underestimated performance when semantically correct phrases deviate from the reference. In BERTScore, the similarity between two sentences is computed as the sum of the cosine similarities between their token embeddings, thereby providing the capability to detect paraphrases.

2. Failure to capture remote relationships and to penalize significant changes in semantic order:

For instance, the BLEU score is not severely affected if the phrases are switched from “A because B” to “B because A”, especially when A and B are long phrases. BERTScore’s contextual embeddings are trained to recognize order and deal with distant dependencies present in the text.

BERTScore Architecture

BERTScore addresses these issues by performing similarity calculations using contextualized token embeddings, which are shown to be effective for entailment detection. Here is a step-by-step explanation of the BERTScore architecture:

Source: Online — https://arxiv.org/pdf/1904.09675.pdf
  1. Contextual Embeddings: Reference and candidate sentences are represented using contextual embeddings based on surrounding words, computed by models like BERT, Roberta, XLNET, and XLM.
  2. Cosine Similarity: The similarity between contextual embeddings of reference and candidate sentences is measured using cosine similarity.
  3. Token Matching for Precision and Recall: Each token in the candidate sentence is matched to the most similar token in the reference sentence, and vice versa, to compute Recall and Precision, which are then combined to calculate the F1 score.
  4. Importance Weighting: Rare words’ importance is considered using Inverse Document Frequency (IDF), which can be incorporated into BERTScore equations, though it’s optional and domain-dependent.
  5. Baseline Rescaling: BERTScore values are linearly rescaled to improve human readability, ensuring they fall within a more intuitive range based on Common Crawl monolingual datasets.

⚠️ Limitation and Bias

Limitations and potential biases associated with using BERTScore for evaluating text generation models :

❎ It can be biased towards models that are more similar to its own underlying model. This is because the metrics can favor their own outputs and other outputs which are more similar to them. This limitation is inherent in all reference-free metrics and is not unique to BERTScore.

❎ The ability of reference-free metrics to evaluate other models is inherently limited by the qualities of their pseudo-references. If a system outputs a translation or a summary which is higher-quality than the pseudo-reference, it will be incorrectly penalized because it is different from the pseudo-reference, even though those differences are actually improvements. Thus, the metrics’ scores of systems which are better in quality than their own underlying models will be misleading.

❎ BERTScore doesn’t take into account the syntactic structure of the sentence. Which can lead to incorrect evaluations in cases where the syntactic structure of the sentences is different but they convey the same meaning.

❎ It might perform poorly on tasks that require understanding the context beyond the individual words, such as idiomatic expressions or cultural references

✅ ADVANTAGES

➡️ Leveraging BERT:

BERTScore uses the power of BERT, a state-of-the-art transformer-based model developed by Google, to understand the semantic meaning of words in a sentence. This leads to a more accurate representation of text similarity compared to traditional methods that rely on syntactic structures.

➡️ Versatility:

BERTScore can handle different types of texts, including pairs, lists, and even single sentences. This versatility makes it a versatile tool for various NLP tasks, from text classification to information retrieval.

➡️ Comprehensive Evaluation Metrics:

BERTScore provides both precision and recall scores, allowing users to have a comprehensive understanding of the performance of their models. This is particularly useful in tasks where false positives and false negatives are equally important.

➡️ Computational Efficiency:

Despite using advanced models like BERT, BERTScore is still able to provide fast and reliable results. This efficiency is crucial, especially when dealing with large datasets.

Implementation

To implement BERTScore in a Python environment, we can use the Hugging Face Transformers library. Below is a simple example of how to use BERTScore to evaluate the similarity between two pieces of text:

!pip install transformers
!pip install bert-score
from transformers import BertTokenizer, BertModel
from bert_score import BERTScorer
# Example texts
reference = "This is a reference text example."
candidate = "This is a candidate text example."
# BERTScore calculation
scorer = BERTScorer(model_type='bert-base-uncased')
P, R, F1 = scorer.score([candidate], [reference])
print(f"BERTScore Precision: {P.mean():.4f}, Recall: {R.mean():.4f}, F1: {F1.mean():.4f}")Conclusion

BERTScore is a powerful evaluation metric that enhances text similarity measurement. By combining precision and recall values, it makes text similarity measurement more accurate and balanced. This offers a significant advantage for many NLP tasks.

Link to complete Colab Notebook

Link — https://gist.github.com/Abonia1/26c13b7034e85ec1dbe29c2fa0d07242

BERTScore can be applied in various domains, including text summarization, translation quality assessment, text generation, and document comparison. The future potential of BERTScore is quite exciting, anticipating improvements like broader language coverage, adaptation for multilingual texts, and enhancements for better performance on diverse text types.

Connect with me on Linkedin

Find me on Github

Visit my technical channel on Youtube

Support: Buy me a Cofee/Chai

--

--

Abonia Sojasingarayar
Abonia Sojasingarayar

Written by Abonia Sojasingarayar

Principal Research Scientist | Machine Learning & Ops Engineer | Data Scientist | NLP Engineer | Computer Vision Engineer | AI Analyst

No responses yet