Question Answering System with BERT

This article explains, What is BERT, the Advantages of BERT, and how to create a QA system with fine-tuned BERT.

Nishanth N

Published in

Analytics Vidhya

3 min readJul 27, 2020

What is BERT?

BERT, which stands for Bidirectional Encoder Representations from Transformers developed by researchers at Google in 2018, is based on Transformers, a deep learning model in which every output element is connected to every input element, and the weightings between them are dynamically calculated based upon their connection.

It is designed to pre-train deep bidirectional representations from an unlabeled text by jointly conditioning on both the left and right contexts. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of NLP tasks.

Why BERT?

BERT can better understand long term queries and as a result surface more appropriate results. BERT models are applied to both organic search results and featured snippets. While you can optimize for those queries, you cannot “optimize for BERT.”

To simplify: BERT helps the search engine understand the significance of transformer words like ‘to’ and ‘for’ in the keywords used.

Question Answering System using BERT

Building a Question Answering System with BERT

For the Question Answering System, BERT takes two parameters, the input question, and passage as a single packed sequence. The input embeddings are the sum of the token embeddings and the segment embeddings.

Token embeddings: A [CLS] token is added to the input word tokens at the beginning of the question and a [SEP] token is inserted at the end of both the question and the paragraph.
Segment embeddings: A marker indicating Sentence A or Sentence B is added to each token. This allows the model to distinguish between sentences. In the below example, all tokens marked as A belong to the question, and those marked as B belong to the paragraph.

The two pieces of text are separated by the special `[SEP]` token.

BERT uses “Segment Embeddings” to differentiate the question from the reference text. These are simply two embeddings (for segments “A” and “B”) that BERT learned, and which it adds to the token embeddings before feeding them into the input layer.

Start & End Token Classifiers

Transformer Architecture of Layers to find start-word and end-word

For every token in the text, we feed its final embedding into the start token classifier. The start token classifier only has a single set of weights which applies to every word.

After taking the dot product between the output embeddings and the ‘start’ weights, we apply the softmax activation to produce a probability distribution over all of the words. Whichever word has the highest probability of being the start token is the one that we pick.

Let’s get started to code!

Install the transformers library,

Load the BertForQuestionAnswering model and the tokenizer.

Create a QA example and use function encode_plus() to encode the example. The function encode_plus() returns a dictionary that contains input_ids, token_type_ids, and attention mask but we only need input_ids and token_type_ids for the QA task.

Run the QA example through the loaded model.

We can get both the start index and the end index and use both the indices for span prediction.

We can recover any words that were broken down into subwords with a little bit more work.

Output:

The answer is : the scientific study of algorithms and statistical models

Conclusion

I hope you have now understood how to create a Question Answering System with fine-tuned BERT. Thanks for reading!