Watch out with leveraging BERTScore for the Evaluation of Language Models

6 min readFeb 12, 2024

As a natural language processing enthusiast, I recently embarked on a project where I delved into the realm of language model evaluation. In this journey, I stumbled upon an invaluable tool called BERTScore, which significantly enhanced the assessment process of my Language Model (LM). However, along the way, I encountered an intriguing phenomenon: the importance of setting rescale_with_baseline to True for accurate results. In this article, I’ll share my insights on the significance of BERTScore for LM evaluation, the implications of its parameters, and why setting rescale_with_baseline to True is crucial.

1. Introduction to BERTScore

BERTScore is a metric designed to evaluate the quality of text generated by language models. Leveraging contextual embeddings from BERT (Bidirectional Encoder Representations from Transformers), it computes the similarity between a reference sentence and a generated sentence. Unlike traditional evaluation methods such as BLEU or ROUGE, BERTScore considers contextual information, leading to more accurate and human-like evaluations. That it is why it has become typical metric for evaluating the performance of Language models.

2. Understanding BERTScore for Language Models

BERTScore operates by computing the similarity between token-level representations of the reference and candidate sentences. By comparing the embeddings in context, it captures nuances that other metrics might miss. This makes it particularly well-suited for evaluating the performance of language models, which strive to generate coherent and contextually relevant text.

3. The problem with BERTScore

In my project I used Llama-7b and prompt engineering to extract TRIZ-concepts from patents. I did have annotated data. So I decided to evaluate my performance using BLEU, ROUGE and BERTScore. These were the results I received:

As you can see I had very different results for BLEU, ROUGE and BERTScore. Regarding BLEU and ROUGE Llama-7b performed rather poorly. However, looking at the BERTscore it seemed to be doing a more than okay job. I explained myself the results with different nature of the metrics. As BERTScore is far less literal and a lot more focused on true semantic similarity than traditional metrics like BLEU and ROUGE it can produce different results. I even wrote an example calculation to explain this phenomenon. For that consider the following sentence pairs which could be from a patent:

Literal overlap:

Candidate: “The method comprises receiving an input signal from a sensor.”
Reference: “The method is about a sensor that recieves an input signal.”

Both sentences convey similar information with overlapping n-grams, which should result in a high BLEU or ROUGE score due to the shared tokens.

2. Semantic overlap:

Candidate: “A computer is configured to communicate with a network for data transmission.”
Reference: “Data transfer is facilitated by a network-connectable laptop.”

While the sentences express the same idea, there are no overlapping n-grams. Instead, they use different phrasings and synonyms, which would likely yield a lower BLEU or ROUGE score despite the semantic similarity. As BERTScore focues on semantic similarity it should also give a good result for this pair.

3. No overlap:

Candidate: “This seat improves security in cars by providing a raised seat.”
Reference: “For the wall either concrete or wood could be used.”

The sentences share no commonalities, thus neither BLEU, ROUGE, nor BERTScore should indicate a high level of similarity between them.

I wrote the following code to calculate the results (github: 01_evaluating_rescale_bertscore.ipynb):

import pandas as pd
from datasets import load_metric
from statistics import mean
from bert_score import BERTScorer

def calc_rouges(rouge_scores, rouge_type):

    """
    Calculate the average ROUGE score for a given ROUGE type from ROUGE scores.

    Parameters:
    - rouge_scores (dict): A dictionary containing ROUGE scores for different ROUGE types.
    - rouge_type (str): The specific ROUGE type for which the average is calculated.

    Returns:
    - float: The average ROUGE score for the specified ROUGE type.
    """
    # Extract ROUGE scores for high, mid, and low levels
    rouge_h = rouge_scores[rouge_type].high.fmeasure
    rouge_m = rouge_scores[rouge_type].mid.fmeasure
    rouge_l = rouge_scores[rouge_type].low.fmeasure

    # Calculate the average ROUGE score using the mean function
    rouge_score = mean([rouge_h, rouge_m, rouge_l])

    return rouge_score

# Load BLEU and BERT_score metrics
bleu_metric = load_metric('bleu')
rouge_metric = load_metric('rouge')
#Watch out: rescale_with_baseline=False the results change dramatically 
bertscore_metric = BERTScorer(lang="en",rescale_with_baseline=True)

# Initialize lists to store evaluation results
patent_nos, bleu, b_prec, b_rec, b_f1 = [], [], [], [], []
#For this we only need rouge1
rouges = {
    'rouge1':[],
    #'rouge2':[],
    #'rougeL':[],
    }
# Initialize dict for data
sentence_pairs = [
    {'candidate':'The method comprises receiving an input signal from a sensor.',
       'ground1':'The method is about a sensor that recieves an input signal'},
     {'candidate':'A computer is configured to communicate with a network for data transmission',
       'ground1':'Data transfer is facilitated by a network-connectable laptop.'},
    {'candidate':'This seat improves security in cars by providing a raised seat.',
     'ground1':'For the wall either concrete or wood could be used'},
]

# Iterate through files in the '/processed' directory
for sentence_pair in sentence_pairs:
        
        candidate = sentence_pair['candidate']
        ground1 = sentence_pair['ground1']
        
        # Compute BLEU score
        bleu_scores = bleu_metric.compute(predictions=[candidate.split(' ')], references=[[ground1.split(' ')]])
        bleu.append(bleu_scores['precisions'][0])

        # Compute ROUGE score
        
        rouge_scores =  rouge_metric.compute(predictions=[candidate], references=[ground1])

        for rouge_type,results in rouges.items():
            rouges[rouge_type].append(calc_rouges(rouge_scores,rouge_type))

        # Compute BERT_score
        P, R, F1 = bertscore_metric.score([candidate], [ground1])
        b_prec.append(P.item())
        b_rec.append(R.item())
        b_f1.append(F1.item())


# Create a DataFrame from the collected data
df_dict = {
    "b_prec": b_prec,
    "b_rec": b_rec,
    "b_f1": b_f1,
    "bleu": bleu,
}

df_dict.update(rouges)

df = pd.DataFrame(df_dict)

df.head()

The results were as follows:

Sentence-Pair 1: While BERTScore indicates a strong similarity between the sentences, both BLEU and ROUGE also yield scores of 60% or higher.
Sentence Pair 2: Once more, BERTScore reveals a significant similarity, whereas BLEU and ROUGE suggest minimal resemblance.
Sentence Pair 3: As anticipated, neither BLEU nor ROUGE registers any resemblance, scoring 0%. But what is up with BERTScore? It scores for all three metrics well above 80 % even though the sentences have nothing in common.

This mirrors the findings of my project. Yet, this raises questions about the reliability of BERTScore as a metric. To delve into this matter, I conducted a closer examination of BERTScore.

4. The Pitfall of rescale_with_baseline = False

In my experimentation with BERTScore, I noticed this peculiar behavior when rescale_with_baseline was set to False (this is the default setting for BERTScore). This parameter controls whether BERTScore rescales the final score with a baseline. Without this rescaling, the scores produced are skewed and inconsistent, leading to misleading evaluations of my LM’s performance.

The crux of the matter lies in the necessity of setting rescale_with_baseline to True. This activates the following line of code in the scorer.py script of the BERTScore-Library.

if self.rescale_with_baseline:
    all_preds = (all_preds - self.baseline_vals) / (1 - self.baseline_vals)

If you access the value of the variable baseline_vals you see that is a value of 0,8315. Now let’s try that out for our example (I am only going to do the calculation for the precision for the third sentence):

Calculation for rescaling BERTScore-Precision for sentence 3.

The value for the precision drops from 85,8 % to a much more plausible 15,5 %. For the other values you get the following results:

Upon rescaling, the BERTScore values exhibit much more plausible results. They align closely with BLEU and ROUGE scores, notably revealing the anticipated differences in similarity among the sentence pairs.

When enabled, BERTScore rescales the final score with a baseline similarity, ensuring that the scores are bounded within a reasonable range. This normalization mitigates the effects of varying token lengths and contextual differences between reference and candidate sentences, resulting in more reliable evaluations.

5. Conclusion

In the realm of language model evaluation, leveraging tools like BERTScore can significantly enhance the accuracy and reliability of assessments. However, it’s crucial to pay attention to the nuances of such metrics and understand the implications of their parameters. Through my journey, I’ve come to appreciate the importance of setting rescale_with_baselineto Truefor accurate evaluations, ensuring that the assessments of language model performance are both meaningful and trustworthy.

As the field of natural language processing continues to evolve, it’s imperative to stay informed about the latest methodologies and tools available for evaluation. BERTScore represents a significant step forward in this regard, offering a more nuanced and contextually aware approach to language model assessment. By understanding its intricacies and optimizing its parameters, we can unlock deeper insights into the capabilities and limitations of language models, ultimately driving innovation and progress in the field.