DistilBert for Question-Answer System

Wael Dimassi
Analytics Vidhya
Published in
7 min readJul 6, 2020

In this article I will present how to create an Open Domain Question Answering bot using bm25 and DistilBert.

As anyone searching for a piece of information, he will enter the question or the keywords and make a Google search about that topic. He will start by clicking on the first links provided. Then will find himself facing several long paragraphs where the needed information is hidden. It could be a tiring and unappealing task to do especially for a person in a hurry. That’s why our project will handle all that by taking your question and going through the whole pipeline for you. Thanks to Natural Language Processing revolutionary techniques and Machine Learning models that now can give way much better results than ever, this is possible and can facilitate your life.

How does that work ?

Let’s get back to the person in a hurry for the information. First, he needs to read pretty much all paragraphs to locate the best one that contains the relevant information, then he will focus on understanding it more and then extracting the exact answer. It is pretty much the same thing here that is happening in our pipeline: the Ranker will be responsible for searching for the best paragraph and the Reader is the one that will extract the exact answer. We can clearly see the analogy between human interaction with finding the right answer and our pipeline. But let me explain more.

Question Answering Bot :

The three main components that the Question Answering bot mechanism is built on are the Retriever, the Ranker and the Reader. The architecture of our pipeline is represented in the figure below.

High level architecture

In more explicit words, we need to make 3 kinds of sorting ; search the best Wikipedia article then the best paragraph and finally the best sentence or paragraph chunk.

So let’s dig deeper into each one :

The retriever : It acts at the early stage, when a question is asked and then sent to the system, it mainly outsources that to google search engine to find the best Wikipedia article and usually get this step returns the required link for the rest of the pipeline.
Once there, we will scrape that web page and extract all paragraphs, store them in a list that will be cleaned afterwards. Because text format is very crucial we really need to focus on how to delete noise and HTML tags, non-useful special characters or Unicode errors.

The ranker: Actually in this phase, we have the list of paragraphs, that could be enormous in certain cases, and the question that we need to match. Such a task is the hardest part since the final output depends a lot on this. We used a bag-of-words retrieval function, called BM25, that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document.

The reader: Final step is getting the final part from the paragraph where the answer of our question is. This kind of task may require a language model with NLU and trained on SQuaD. In fact, even with using big model, skipping the ranking phase is error prone and will affect enormously the results because such models can get lost in big text data but are very powerful when it comes to average / small paragraphs.

As you may know there are many OD QA solutions that tackle the same problem. The accuracy of the results depends on the chosen models and NLP tasks used. But our motivation for this project was testing the DistilBERT as a question answering model to obtain a high accuracy and relevant results. We have also taken into consideration building a web interface for our solution that we will present at the end of the article.

Ranker using bm25 what is it? Why did we use it?

Bm25 is a ranking function. It represents state-of-the-art TF-IDF-like retrieval functions used in document retrieval. It is more than a term scoring method, but rather a method for scoring documents with relation to a query.
We have chosen to use it since it has better results than TF-IDF when used in ranking and it’s more robust. (2)

Actually BM25 stands for “Best Match 25”. Released in 1994, it’s the 25th iteration of tweaking the relevance computation. BM25 has its roots in probabilistic information retrieval. Probabilistic information retrieval is a fascinating field unto itself. Basically, it casts relevance as a probability problem. A relevance score, according to probabilistic information retrieval, ought to reflect the probability a user will consider the result relevant.

Ranking Function :

Ranking function will calculate the score for each paragraph and the question asked and then sort it and returns the highest score.

For Question Q, Documents as D we use this formula :

Where IDF : Inverse Term Frequency, where f (qi , D ) is qi ‘s term frequency in the document D. k1 and b are free parameters, usually chosen, in absence of an advanced optimization, as k1 ∈ [ 1.2 , 2.0 ] and b = 0.75.

Well don’t get intimidated by the math behind this. What we wanted to show is how this overpass TF-IDF and as far as we used this it gave promising results. For general knowledge, this method is used in O’Reilly library project for searching book chunks.

Why Distilbert as a QA model?

DistilBERT is a reference to distillation technique for making Bert models smaller thus faster. In fact, distillation is a technique used for compressing a large model, called the teacher, into a smaller model, called the student to reproduce the same behavior. Since the other models are very large ; today we talk about models with 8.3 billion parameters: 24 times larger than BERT-large, 5 times larger than GPT-2, while RoBERTa, the latest work from Facebook AI, was trained on 160 GB of text. Which is huge and furthermore not a good choice for production, even with GPU, this is still hard to implement regarding size and response time.

So the question was how can we tame these big models ?

Actually, several papers were concerned toward this. The majority of the papers are not academic ones, since the models are becoming so big that require huge computational power and also the environmental cost of training those models is expensive : They require a lot of energy thus a lot of carbon dioxide emission.

For making smaller models there are 3 main approaches :

  • Pruning :

Model pruning in a nutshell is about eliminating superficial values in the weight tensors. Setting the neural network parameters’ values to zero to remove what is estimated as unnecessary connections between the layers of a neural network.
The process of pruning happens during the training process to allow the neural network to adapt to the changes. In fact, this allows us to reduce 90% of the sparse model for MNIST 12 MB to 2 MB. Also, this technique is applicable to any NN type across distinct tasks.

  • Quantization:

Quantization is also a simple technique for performing computations and storing tensors at lower bit widths. Actually a quantized model has changed some or all of the values on tensors to integers rather than floating-point values. This allows for a more compact model representation and the use of high performance vectorized operations on many hardware platforms. The quantized model compared to the normal models can get to a 4x reduction in the size and a 4x reduction in memory bandwidth requirements. Thus, simpler calculation less amount of energy consumption. But, with quantized weights during the post training, there could be an accuracy loss, particularly for smaller networks which is the price to pay.

  • Distillation :

Knowledge distillation is a model compression method in which a small model is trained to mimic a pre-trained, larger model. This training setting is sometimes referred to as “teacher-student”, where the large model is the teacher and the small model is the student.

As a good example of distillation, we have DistilBert, if we compare its performance against: BERT base which is not one but DistilBERT’s teacher we can obviously say that DistilBERT compares surprisingly well to BERT: we are able to retain more than 95% of the performance with 40% fewer parameters. All this while in terms of inference time, DistilBERT is 60% faster and smaller than BERT and 120% faster and smaller than ELMo+BiLSTM.

Question answering System interface :

After some extra work, I managed to wrap-up my Question Answering System into an API. And with some luck, I created a nice HTML page that consume that API in a way to send the query and receive the main answer, paragraph where the answer was retrieved and the link of the article from Wikipedia.

Web interface of the QA System

Conclusion :

This project was initially for fun only and for the sake of investing the power of the DistilBert. You may check the whole project in here. Please note that pull requests are welcome.
Feel free to reach out to me in case you have a comment.

--

--