Build a Smart Question Answering System with Fine-Tuned BERT

Published in

Saarthi.ai

5 min readJun 4, 2020

At the end of 2018, researchers at Google AI Language open-sourced a new technique for Natural Language Processing (NLP) called BERT (Bidirectional Encoder Representations from Transformers).

BERT exhibited unprecedented performance for modelling language-based tasks.

In this blog post, we are going to understand how we can apply a fine-tuned BERT to question answering tasks i.e given a question and a passage containing the answer, the task is to predict the answer text span in the passage.

This article is structured as follows -

How BERT works
How to use BERT for question answering
Code Walk-through

A Peak inside BERT (Bidirectional Encoder Representations)

BERT uses Transformer encoder blocks. The transformer encoder uses attention (Multi-Headed Self Attention) mechanism that learns contextual relations between words (or sub-words) in text.

BERT alleviates the unidirectionality constraint by using a “masked language model” (MLM) pre-training objective. The MLM model randomly masks some of the tokens from the input, and the objective is to predict the masked word based on its surroundings (left and right of the word).

As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the MLM objective enables the representation to use both the left and the right context, which allows to pre-train a deep bidirectional Transformer.

In addition to the masked language model, BERT also uses a “next sentence prediction” task as the pre-training objective. This makes BERT better at handling relationships between multiple sentences. During training, 50% of the inputs are paired, in which the second sentence is the subsequent sentence in the original document. The other 50% of the input is random sentences from the corpus as the second sentence.

The Architecture of BERT (Birectional Encoder Representations from Transformers)

Using BERT in Question Answering Systems

Building a Question Answering System with BERT: SQuAD 1.1 Source

For the Question Answering task, BERT takes the input question and passage as a single packed sequence. The input embeddings are the sum of the token embeddings and the segment embeddings. The input is processed in the following way before entering the model:

Token embeddings: A [CLS] token is added to the input word tokens at the beginning of the question and a [SEP] token is inserted at the end of both the question and the paragraph.
Segment embeddings: A marker indicating Sentence A or Sentence B is added to each token. This allows the model to distinguish between sentences. In the below example, all tokens marked as A belong to the question, and those marked as B belong to the paragraph.

To fine-tune BERT for a Question-Answering system, it introduces a start vector and an end vector. The probability of each word being the start-word is calculated by taking a dot product between the final embedding of the word and the start vector, followed by a softmax over all the words. The word with the highest probability value is considered.

A similar process is followed to find the end-word.

Transformer Architecture of Layers to find start-word and end-word

Note: The start vector and the end vector will be the same for all the words.

The Hugging Face Transformers library has a BertForQuestionAnswering model that is already fine-tuned on the SQuAD dataset. The Stanford Question Answering Dataset (SQuAD) is a collection of 100k crowdsourced QA pairs.

Let’s have a walk-through of the code!

Install the transformers library.

Load the BertForQuestionAnswering model and the tokenizer.

Note: The BertForQuestionAnswering class supports fine-tunning. We can fine-tune this model on our own dataset.

Create a QA example and use function encode_plus() to encode the example. The function encode_plus() returns a dictionary that contains input_ids, token_type_ids, and attention mask but we only need input_ids and token_type_ids for the QA task.

Note: In the case of multiple QA examples, we’ll need to make all the vectors the same size by padding shorter sentences with the token id 0.

Run the QA example through the loaded model.

Now we have start scores and end scores we can get both the start index and the end index and use both the indices for span prediction.

Note: The model is likely to predict an end word that is before the start word. The correct way is to pick a span for which the total score (start score + end score) is maximum where end_index ≥ start_index.

Predicted Answer: “method ##ical and exceptionally detailed in their bible study, opinions and disciplined lifestyle”Ground Truth Answers: 
1.being methodical and exceptionally detailed in their Bible study
2.They focused on Bible study, methodical study of scripture and    living a holy life 
3.being methodical and exceptionally detailed in their Bible study, opinions and disciplined lifestyle.

Note: BERT uses wordpiece tokenization. Wordpiece split the tokens like “playing” to “play and ##ing”. It also covers a wider spectrum of Out-Of-Vocabulary (OOV) words.

We can recover any words that were broken down into subwords with a little bit more work.

Corrected Answer: "methodical and exceptionally detailed in their bible study, opinions and disciplined lifestyle"

Conclusion

BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language Processing. It enables fast fine-tuning and can be used for a wide range of practical applications and downstream tasks.

Here, I’ve tried to give a complete guide to use BERT for the question answering task, with the hope that you will find it useful to do some NLP awesomeness.

If you’ve enjoyed this article, and find it useful, please give it a clap and let other NLP enthusiasts reach it.

For more articles, disseminations, and hands-on NLP tutorials, follow us on medium.com.

Find us on Facebook, LinkedIn, and Twitter,where we regularly post useful articles for NLP practitioners and Conversational AI enthusiasts.

References

Applying BERT to Question Answering https://www.youtube.com/channel/UCoRX98PLOsaN8PtekB9kWrw
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) http://jalammar.github.io/illustrated-bert/
BERT paper https://arxiv.org/pdf/1810.04805.pdf