An Evaluation of Question Answering Model

9 min readApr 26, 2019

keywords: Question Answering, BERT, SQuAD, Machine Comprehension.

I Background

This article is purely about the performance evaluation of knowledge QA model based on concrete examples and we assume readers heard of or briefly understand terms like Question Answering, BERT and SQuAD dataset.

1.1 What is QA model?

In a simple way to rephrase, a Knowledge QA model aims to answer short questions given by paragraph contents (the model here consume short paragraph only; if the given paragraph is too long, another model shall be considered to narrow down answer context, see reference [3]). The answer usually can be found in the context and the model will extract the original text span to answer question. The answer span is also not too long. Moreover, unlike SQuAD 2.0, model is not required to find out answer that not in given context.

If you are not familiar with QA model, some links that help you to learn more about BERT [1]and SQuAD [2] dataset are in references in the end.

1.2 Our model

The model in this article is using pre-trained BERT model (uncased_L-24_H-1024_A-16) and fine-tuned with SQuAD-V1.1. The F1 score of dev is 89.1 and exact match 82.1. The model will give top-k answers ranking by confidence score or say probability of answer.

II Evaluation

All the examples below is all from open domain or made up by myself. They didn’t exist in training dataset. And the examples just target to explore what our QA model can do and cannot.

2.1 Example One

Let’s start with the first example. It is a paragraph from wikipedia.

“Mount Logan is the highest mountain in Canada and the second-highest peak in North America, after Denali. The mountain was named after Sir William Edmond Logan, a Canadian geologist and founder of the Geological Survey of Canada (GSC). Mount Logan is located within Kluane National Park Reserve[4] in southwestern Yukon, less than 40 kilometres (25 mi) north of the Yukon–Alaska border. Mount Logan is the source of the Hubbard and Logan glaciers. Logan is believed to have the largest base circumference of any non-volcanic mountain on Earth (a large number of shield volcanoes are much larger in size and mass), including a massif with eleven peaks over 5,000 metres (16,400 ft).”

Then we ask two questions: “what is highest mountain in Canada?” and “where is Mount Logan?” We let the model give top 3 best answers with highest confidence. Th bold marks above just help readers easily find out true answers.

{
    "best_prediction": "Mount Logan",
    "id": "01",
    "n_best_predictions": [
        {
            "end_logit": 5.746790409088135,
            "probability": 0.9999999250405683,
            "start_logit": 5.814833164215088,
            "text": "Mount Logan"
        },
        {
            "end_logit": -2.364010810852051,
            "probability": 4.6923180635147034e-8,
            "start_logit": -2.9491195678710938,
            "text": "eleven peaks over 5,000 metres (16,400 ft)"
        },
        {
            "end_logit": -2.8790242671966553,
            "probability": 2.8036251062885282e-8,
            "start_logit": -2.9491195678710938,
            "text": "eleven peaks over 5,000 metres"
        }
    ],
    "question": "what is highest mountain in Canada?"
}

For the first question, model answer the question clearly and are 99.9% sure about the answer.

{
    "best_prediction": "Kluane National Park Reserve[4] in southwestern Yukon",
    "id": "02",
    "n_best_predictions": [
        {
            "end_logit": 4.906057834625244,
            "probability": 0.5532496572648057,
            "start_logit": 4.136287689208984,
            "text": "Kluane National Park Reserve[4] in southwestern Yukon"
        },
        {
            "end_logit": 4.098901271820068,
            "probability": 0.24681838842035841,
            "start_logit": 4.136287689208984,
            "text": "Kluane National Park Reserve"
        },
        {
            "end_logit": 4.906057834625244,
            "probability": 0.19993195431483585,
            "start_logit": 3.118455410003662,
            "text": "southwestern Yukon"
        }
    ],
    "question": "where is Mount Logan?"
}

For the second question, model also perfectly give correct answer. But this time model is not sure about how much information it shall give, that means whether the answer shall include more details with “southwestern Yukon” or not. But whatever, the answer is correct!

2.2 Example Two

Now let’s make the question little bit more difficult. This example is from Danqi Chen’s phd thesis in chapter 5.3.4 [4].

“Many aspects of speech recognition were taken over by a deep learning method called long short-term memory (LSTM), a recurrent neural network published by Hochreiter and Schmidhuber in 1997.[51] LSTM RNNs avoid the vanishing gradient problem and can learn \”Very Deep Learning\” tasks[2] that require memories of events that happened thousands of discrete time steps before, which is important for speech. In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks.[52] Later it was combined with connectionist temporal classification (CTC)[53] in stacks of LSTM RNNs.[54] In 2015, Google’s speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which they made available through Google Voice Search.”

And our question is “Who invented LSTM?”. This question is a little bit harder because 1) the keywords “LSTM” appear several times in context 2) the other keyword “invented” does not exactly match any text; 3) the answer is somewhat far away from keyword LSTM (unlike the first example answer is followed by closely, but there is an interruption of “recurrent neural network”). Still let’s see top 3 answers.

{
    "best_prediction": "Hochreiter and Schmidhuber",
    "id": "03",
    "n_best_predictions": [
        {
            "end_logit": 5.893029689788818,
            "probability": 0.9980743349387845,
            "start_logit": 6.029671669006348,
            "text": "Hochreiter and Schmidhuber"
        },
        {
            "end_logit": -0.44507071375846863,
            "probability": 0.0017642529692320442,
            "start_logit": 6.029671669006348,
            "text": "Hochreiter and Schmidhuber in 1997."
        },
        {
            "end_logit": 5.893029689788818,
            "probability": 0.00016141209198352368,
            "start_logit": -2.699950695037842,
            "text": "Schmidhuber"
        }
    ],
    "question": "Who invented LSTM?"
}

Score again! The model cut the buck to give correct answers of two co-inventors with 99% confidence.

2.3 Example Three

But does it mean our model is all-around? Not at all. The questions above are still entry-level. Let’s go further and add some easy “reasoning”. That means the model cannot only “copy” original text but need further understand context (aka Machine Comprehension). The context below is short with 3 sentences but it contains 2 reasons that lead to the final result.

“Bob’s daughter got sick. His wife Mary is out of town therefore cannot take care of their daughter. So he leaves early today.”

Our question is to explain “why is Bob leaving early today?”.

{
    "best_prediction": "His wife Mary is out of town therefore cannot take care of their daughter",
    "id": "04",
    "n_best_predictions": [
        {
            "end_logit": 3.2016761302948,
            "probability": 0.5883743352379976,
            "start_logit": 3.460529327392578,
            "text": "His wife Mary is out of town therefore cannot take care of their daughter"
        },
        {
            "end_logit": 3.2016761302948,
            "probability": 0.22127723518746942,
            "start_logit": 2.4825823307037354,
            "text": "Bob's daughter got sick. His wife Mary is out of town therefore cannot take care of their daughter"
        },
        {
            "end_logit": 2.073168992996216,
            "probability": 0.19034842957453293,
            "start_logit": 3.460529327392578,
            "text": "His wife Mary is out of town"
        }
    ],
    "question": "why is Bob leaving early today?"
}

We can see the best answer is not the complete one. And among 2 reasons, model will first pick the closest text to “question context”. But the second best hit the score. Not too bad!

2.4 Example Four

Now let’s make “reasoning” more complex. In this example, we fix the question with “which team is the first place?” but change the given context. We can refer to context ID to check given context.

Context for 11: “Altogether there are 10 teams joined. team C beat all other teams. team A and team B also won the second and third.”
Context for 12: “team A beat team B in this game, but team C beat team A.”
Context for 13: “team A beat team B in this game, but team A losed out to team C.”

{
    "best_prediction": "team C",
    "id": "11",
    "n_best_predictions": [
        {
            "end_logit": 1.0532582998275757,
            "probability": 0.6143352942959913,
            "start_logit": 0.4422859251499176,
            "text": "team C"
        },
        {
            "end_logit": 1.0532582998275757,
            "probability": 0.19654012027586923,
            "start_logit": -0.6973883509635925,
            "text": "C"
        },
        {
            "end_logit": -0.12487658113241196,
            "probability": 0.18912458542813956,
            "start_logit": 0.4422859251499176,
            "text": "team C beated all other teams. team A and team B"
        }
    ],
    "question": "which team is the first place?"
}

Given context 11, the answer is correct and model doesn’t get confused among three teams and we can see it also gives the sentence “team C beat all other teams” more attention.

{
    "best_prediction": "team A beated team B in this game, but team C",
    "id": "12",
    "n_best_predictions": [
        {
            "end_logit": 1.040236473083496,
            "probability": 0.3627309465126682,
            "start_logit": 0.3684903085231781,
            "text": "team A beated team B in this game, but team C"
        },
        {
            "end_logit": 0.932282567024231,
            "probability": 0.3256123160780018,
            "start_logit": 0.3684903085231781,
            "text": "team A"
        },
        {
            "end_logit": 1.040236473083496,
            "probability": 0.31165673740933,
            "start_logit": 0.21673132479190826,
            "text": "team C"
        }
    ],
    "question": "which team is the first place?"
}

But in context 12, model is confused between “team A” and “team C” and both get almost equal confidence. The logic of A>B & C>A => C>A&B is not captured at all.

{
    "best_prediction": "team A",
    "id": "13",
    "n_best_predictions": [
        {
            "end_logit": 0.7608197331428528,
            "probability": 0.5055971009292113,
            "start_logit": 0.06729216873645782,
            "text": "team A"
        },
        {
            "end_logit": 0.09393426775932312,
            "probability": 0.25952541690952574,
            "start_logit": 0.06729216873645782,
            "text": "team A beated team B"
        },
        {
            "end_logit": 0.7608197331428528,
            "probability": 0.23487748216126303,
            "start_logit": -0.6993839144706726,
            "text": "A"
        }
    ],
    "question": "which team is the first place?"
}

For context 13, the model totally makes mistakes on logic that A>B and A<C but give higher confidence on team A. Therefore, we can see model doesn’t understand the contradictory meaning of “beat” and “lose” and their relationship to “first place”. And we also don’t know how the model answer in the first context — that by understanding that “beat all other” means “the first place” or by excluding “team A and team B also won the second and third” means “not the first place”, even though the former one is paid more attention to.

2.5 Example Five

Can our model understand “numbers” or even do maths? Let’s give simple examples.

“Alice won 20 scores in the match and Bob won 10 scores”

The two questions are “who get less score?” and “who get more score?”. From examples above, we don’t doubt that model can extract two persons here: Alice and Bob. But does it understand 10 is less than 20 or 20 is more than 10? The answer is disappointing.

{
    "best_prediction": "Bob",
    "id": "001",
    "n_best_predictions": [
        {
            "end_logit": 2.3199944496154785,
            "probability": 0.7744947758773805,
            "start_logit": 2.2076666355133057,
            "text": "Bob"
        },
        {
            "end_logit": 2.3199944496154785,
            "probability": 0.18641455943280172,
            "start_logit": 0.7834287285804749,
            "text": "Alice won 20 scores in the match and Bob"
        },
        {
            "end_logit": 0.757905125617981,
            "probability": 0.03909066468981782,
            "start_logit": 0.7834287285804749,
            "text": "Alice"
        }
    ],
    "question": "who get less score?"
}{
    "best_prediction": "Bob",
    "id": "002",
    "n_best_predictions": [
        {
            "end_logit": 2.082188367843628,
            "probability": 0.6096566286536986,
            "start_logit": 1.9376347064971924,
            "text": "Bob"
        },
        {
            "end_logit": 2.082188367843628,
            "probability": 0.3006593492090547,
            "start_logit": 1.2307167053222656,
            "text": "Alice won 20 scores and Bob"
        },
        {
            "end_logit": 0.8725031018257141,
            "probability": 0.08968402213724672,
            "start_logit": 1.2307167053222656,
            "text": "Alice"
        }
    ],
    "question": "who get more score?"
}

Even though the first answer is correct, but from the result of second, we can see the model doesn’t understand at all!. So it just got lucky in the first question. Now this may lead to our thinking that, sometimes even if the answer is correct, without further explanation to deliver, we cannot judge that the model really understand context. This example greatly exposes that model is just by “random guess”. But is it really random? My opinion here is “Bob”’s confidence much larger than “Alice” is not by accident, but the model just pay more attention to the later matches. If we try to explore how model works here, the model is able to detect Person around keywords “score” and extract two names “Alice” and “Bob”. However, the second person’s attention score overshadow the first person without further reasoning of value compare of two scores 10 and 20.

2. 6 Example Six

Based on failure above, we don’t expect model do any maths. Let’s just simply give an example to prove that.

“Alice ate 1 apple in the morning and 2 apples in the evening.”

{
    "best_prediction": "2",
    "id": "001",
    "n_best_predictions": [
        {
            "end_logit": 3.7769575119018555,
            "probability": 0.8634396322473342,
            "start_logit": 3.6591522693634033,
            "text": "2"
        },
        {
            "end_logit": 1.768803596496582,
            "probability": 0.11590490485853358,
            "start_logit": 3.6591522693634033,
            "text": "2 apples"
        },
        {
            "end_logit": 3.7769575119018555,
            "probability": 0.020655462894132334,
            "start_logit": -0.07379188388586044,
            "text": "1 apple in the morning and 2"
        }
    ],
    "question": "how many apples did Alice eat today?"
}

The model forgets about 1 apple in the morning but just gives apple in evening. Actually this example further proves my assumption above that the matches in the later context may overshadow previous ones. Of course, the model will never give answer 3 as it is not in the original text.

III Conclusion

We understand that the model here is not the best in research but purely a fair one. But we hope the examples still can show ability and expose limits of model BERT and what current NLP model can achieve and what shall be improved in the future. The conclusion summarized as:

QA model is able to extract original information around question keywords and understand simple synonyms (Example One and Two).
QA model is poor on “reasoning” if the relationship is somewhat complex (Example Three and Four).
QA model is not able to understand “number” and no mathematical capability at all (Example Five and Six).

It is remarkable that with current model performance, we are able to apply QA model in some industry cases but cannot be too aggressive for some cases.

I hope this evaluation here can help applied data scientist like me to understand what to expect and what not expect.

IV References

[1] https://github.com/google-research/bert

[2] https://rajpurkar.github.io/SQuAD-explorer/

[3] https://github.com/facebookresearch/DrQA

[4] https://stacks.stanford.edu/file/druid:gd576xb1833/thesis-augmented.pdf

An Evaluation of Question Answering Model

I Background

II Evaluation

III Conclusion

IV References

Written by Joyce Y.