Understanding the SEMBLEU (and BLEU) Metric
SEMBLEU: A Robust Metric for AMR Parsing Evaluation
I have recently begun exploring Abstract Meaning Representations for Semantic Inference. I am familiar with the popular SMATCH metric but today will learn about SEMBLEU, which is a more robust evaluation for AMR parsers.
AMR
The whole point of an AMR graph is to encode the semantics of a sentence into a directed graph. Nodes in the graph represent semantic terms in the sentence and the edges identify the semantic relations between nodes. A major goal in semantic inference tasks is to be able to extract these graphs from regular sentences. If we can automatically extract a sentence’s semantics and corresponding relationships, this will have a huge impact on NLP as modern language models often struggle to understand a lot about semantics (particularly long-range semantic relations).
In an effort towards automatically extracting AMR graphs from input sentences, we need proper evaluation metrics to be able to understand the similarity between an inferred graph and a gold-labeled graph. For this, people have historically used the SMATCH metric for AMR evaluation.
SMATCH
The classic way of comparing if two AMR graphs are similar is to use the SMATCH performance metric, which effectively solves for the best possible mapping between two graphs and computes the corresponding F1 score. There are a few problems with this.
1) Finding the exact best mapping is an NP-complete problem. Thus, the resulting strategy to find a good approximation of the best matching is error-prone which weakens the quality of the metric. There are ways to make this approach less error-prone, such as running it multiple times with different random initializations, however, this is very expensive.
2) Another problem with SMATCH is that, when comparing edge labels, it doesn’t consider node content. Thus, if the nodes are incorrect but the edges are correct, SMATCH rewards the model which doesn’t really make sense.
Consider the two AMR graphs in Image 1 which gives a smatch score of .25. One might ask, why would we give ANY points for graphs that have no semantic similarity even if there is a configuration in which some of the edge labels align? This is clearly a flawed procedure that SEMBLEU tries to address.
SEMBLEU
Let’s introduce SEMBLEU, which is an adaptation of the popular BLEU metric often used to evaluate the output of machine translation models.
Why use SEMBLEU
SEMBLEU views nodes/edges as n-grams, thus forcing the evaluation process to match nodes and edge labels as a unit
- This ensures that models are not falsely attributed F1 points for producing the correct edge labels on incorrect node pairs.
SEMBLEU does not suffer from search errors and is significantly faster
- SEMBLEU works by computing the BLEU score on AMR graphs where each n-gram is composed of nodes and edges
By expanding how many n-grams of nodes we wish to look at, we can evaluate longer and longer dependencies within the graph.
- For example, if we consider a tri-gram in a SEMBLEU calculation, then we are looking to see if the model correctly identified a series of 3 nodes. The number of n-grams to be considered during evaluation is a hyperparameter of the SEMBLEU system.
Computing SEMBLEU
Background Knowledge on BLEU
The original BLEU metric is used to identify how well a machine translation algorithm is performing. Here’s what happens if we don’t use BLEU and use a metric like Precision
Reference 1: The cat is on the mat
Reference 2: There is a cat on the mat
Predicted Translation: The The The The
In the example above, we could evaluate the predicted translation by looking to see if each word in the predicted translation appears in the reference sentences. However, this would give us a precision of 4/4 which is obviously not a good representation of how high-quality the translation is.
What BLEU does is smarter! Let's look at the equation for BLEU
Where k is the number of n-grams in consideration and w_k is how much we weigh the classification of each n-gram. The “Brevity Penalty”, BP is used to encourage longer sentences. len(reference) and len(prediction) refer to the length of the reference and predicted sentences respectively.
Let's walk through an example and learn what p(k) means along the way. Suppose we want to compare this predicted translation to our two reference sentences.
Reference 1: The cat is on the mat
Reference 2: There is a cat on the mat
Predicted Translation: The cat the cat on the mat
In the table above, the count is how many times each bigram appears in the Predicted Translation. The Count Clip (which is the core of BLEU), is the maximum number of times the bigram appears in either sentence.
Given this information, we can compute p(k)
Now again for n=1
We see that p(k)=5/7
To compute the rest of the BLEU score, we would simply plug the sentence lengths into the BP calculation and choose weights for each n-gram depending on how much we care about classifying unigrams vs bigrams. We can simply set each to .5 if we want to weigh them equally.
Adapting BLEU for Semantic Graphs — SEMBLEU
We can use this same idea to compute the relationship between two semantic graphs, as introduced in the SEMBLEU paper. The high-level adjustments that need to be made are as follows
1) Instead of viewing n-grams as words in a sentence, we use nodes in a graph. Specifically, a unigram will consist of a single node, a bigram will consist of two nodes and an edge, a trigram will have 3 nodes and 2 edges, etc.
In the graph below, we would find a bigram = (ask-01, ARG1, leave-11)
2) Instead of considering sentence length (in the BP term) we use graph size. Specifically
3) We compute the n-grams by starting from all possible root nodes.
Now we can apply the same logic as BLEU to semantic graphs! We would get all of the n-grams from the predicted graph, check to see if they appear in the reference graph, and compute the BLEU score!
References
SEMBLEU Paper: https://arxiv.org/abs/1905.10726
BLEU Paper: https://www.aclweb.org/anthology/P02-1040/