How I build a question answering model

Published in

Analytics Vidhya

6 min readApr 13, 2020

A question answering model is simply a computer program that answers the questions you ask.

To make a program capable of doing this we will need to train a machine learning algorithm with a series of questions and answers.

We will see how to do this in this article.

The dataset we will use is The Stanford Question Answering Dataset, it references over 100,000 answers associated with their question.
In this dataset we also have other informations that we won’t use in this simple case.

Preparing data

retrieve data in a dataframe

This is how the data in the Json file that provided us looks like.
We want to retrieve the response_text field for each response as well as the information associated with it. To do this we can proceed as follows:

By applying this function we go from Json to a dataframe with which we will be able to work:

Here each line represents an answer to a question. This will help us when we are going to train our machine learning model as we are going to give it a question as a feature and an answer as a label.

Training word embedding models

Machine learning models can only be trained with numbers, so we will have to vectorize our questions and answers following this diagram

We train our word embedding models on the context (the texts to which the questions relate).
We vectorize our questions and answers with the model previously obtained.
We train our machine learning model with these vectorized questions and answers.

So, as a first step, we need to retrieve our contexts, preprocess it, and, training our word embedding models with it.

Pre-processing feature and label

for the answer, we’ll take the sentence where the answer is. It we’ll help us when we will score our model.

This function go through the context and finds the sentence where the answer is, after that it simply preprocesses those sentences and drops some unvaluable answers.

For the questions it’s much more simple, we just use the simple preprocess gensim function.

Embed question and answer

We will use two different methods to embed our questions and answers. We will then compare their performance.

The first method we will use is very simple, we will use FastText for embed everything. FastText has the advantage of handling words that are out of vocabulary, so we won’t have any errors due to the fact that we are trying to embed words that are out of vocabulary.

The second method is a little more complex, here is the code:

In fact, all that puzzling code could be summed up with this diagram

Then we can apply those functions to our dataset to embed the questions and ansewers.

We begin by the mix embedding:

We average each question in order to have a vector per sentence and not per word.

We can do the same for the FastText embeding:

Just before training our machine learning model we need to deal with the shape of the input. Indeed, scikit learn machine learning model take a 2D array as input which is not the case of the pandas Series that we just create.

To deal with this problem, we create a function that transforms the pandas series into a 2D array that can be used by the learning machine model.

question_np, answer_np : 2D arrays that contains question and answers vectorized with the mix method
question_fastText_np, answer_fastText_np : 2D arrays that contains question and answers vectorized with the only FastText method

Training and scoring models

Training our Machine learning models

We’ll use two different models from scikit learn library, Support Vector Regressor (SVR) and Gradient Boosting regressor (GBR). We are going to use the multi output regressor which, as its name indicates, allows us to obtain a set of output values (here a vectorized response).

To test the effectiveness of our word embeding methods, we will train the models with the two different sets of questions and answers we have.

So we will have 4 different models:

A SVR model with questions and answers embedded with the mix method
A SVR model with questions and answers embedded with the only fastText method
a GBR model with questions and answers embedded with the mix method
a GBR model with questions and answers embedded with the only fastText method

Scoring our models

Now it will be important to test our trained models. For this we will not be able to use the functions already made by scikit learn being based on our use case.
In this article I’ll show you how to do these functions but I won’t run them because they take several days to process the whole dataset.

So we will use two main functions:

get_vectorized_context takes a context as input and returns an array containing each sentence of this context associated with its vectorization.
least_distance takes a vector x and an array of vectors y as input, and returns the closest vector of y to x by applying the Euclidean distance.

If in our context, the closest vector to the one we have predicted is the true answer, we consider the prediction to be good.

In this article we have seen a classical NLP problem, this one has allowed us to experiment the different steps necessary for preprocessing and analyzing textual data.

We also understood all the interest of cloud processing, one of the models presented taking more than 20 minutes to train, we can’t try many parameters for them.
With more power we could have done some cross validation or grid search to find the best parameters.