Question Answering and Information Retrieval using Neural Networks

8 min readAug 30, 2022

In traditional Question Answering and retrieval systems, objects are stored in a relational database and retrieved or ranked using classical algorithms. With advances in machine learning, Deep Neural Networks are now a feasible candidate for replacing traditional question-answering and search algorithms. In fact, neural Networks can even outperform traditional methods.

* My follow up article is up!

In this step-by-step tutorial, you’ll build a neural network to perform text-based question answering and searches from scratch. By the end of this article, you would have the necessary knowledge and code to implement a neural network on your own data. The approach utilized in this article can also be used for other tasks such as Recommendations (Collaborative / Content based), Chatbots (i.e. retrieval of similar dialogues for generation), Cross Modality/Language search.

The key components of a question answering/retrieval system

The easiest way to search for an object is via its attributes. For text, we can simply search for what we want based on the words and context. For other modalities such as image or audio, it becomes a lot harder and would need similarity comparison functions and ways to compress the data. But the idea is similar to what is being applied in this article.

Besides being able to represent and search for an object, we need a way to store and hold the data. This is where data structures come in handy. A common approach is to create an array, tree, or graph-based index to hold your data. Trees and graphs scale well and are often used. And with that said, these are the essential components we need to build any Question answering and retrieval system.

Clustering your way to an effective and efficient question answering and search system

Photo by Quaritsch Photography on Unsplash

The key idea behind question answering and search systems powered by Neural Networks is clustering in a high-dimensional space. Typically, such clustering is coupled with a data structure to imbue elements of persistence, hierarchy, and organization into the data. Clustering also helps to some extent in dealing with real-world data which typically do not come from the same distribution. You may have noticed that this is very similar to algorithms like Kmeans, SVM, or GMM and that's exactly the idea. The only difference here is that Neural Network clustering allows for semantic understanding beyond just adjustable weights. However before we can do any clustering, we will need to represent the data in a multi-dimensional space.

A traditional approach is to represent a collection of texts as a bag of words, where each passage is a row of numbers and each number indicates the frequency of occurrence for a unique word. This works well in practice but has its own drawbacks such as vector sparsity, the curse of dimensionality, and a limited ability to capture the semantic meaning of a text.

The neural network approach is similar where vectors of continuous random digits are used to represent text. When fed to a model like Bert, the self-attention mechanisms allow for the model to capture the semantic meaning between the words. Specifically, the neural network will map texts into a common vector space wherein similarity measures such as dot product is then used to measure similarity between the vectors. To produce learned vectors that can be clustered, such neural networks are typically trained to maximize the similarity between relevant pairs.

Hence, we want to learn a relevance function (denoted by h) that outputs high similarity values for passages (denoted as p) that are relevant to a query (q) and low otherwise. This will be achieved via a pair of neural networks denoted as E_q and E_p, defined as the multiplication of the transposed output of E_q with that of E_p:

Additionally, as with every neural network, we need to optimize a function. We do so by minimizing the subsequent loss and finding a set of weights defined as theta:

This loss function is Cross-Entropy and it will cause the network to bring texts that are relevant (higher similarity value) nearer together when visualized in a 3-dimensional space. It does so by maximizing the positive part which is seen in the numerator. While those incorrect pairs captured by the summed term in the denominator will cause there to be at least some distance further than the existing distance difference between a positive pair. The following images illustrate that:

Photo retrieved from link by Mahmut KAYA

Essentially, a model will seek to maximize the likelihood of predicting the correct result for a query. Next, let's see how you can train such a network. But first, we will need a dataset.

Prerequisites

Before we proceed any further, you will need to install Pytorch and have Nvidia CUDA installed. You can refer to this link for help in setting up the prerequisites. We will also need a GPU. Otherwise, you can just use Google Colab. If unsure of how, you can refer to this guide on how to use it!

Getting and creating a dataset

For this example, I used a movie dataset (CSV file) from Kaggle. In this dataset, the title and plot of a movie are given. Our goal here is to create a search system that can find related passages from a movie’s plot based on your query.

For that, we have to first create training data. The following code will help to create a dataset. In the code below, I decided to chunk the plot of each movie into passages of 100 lengths and use the movie name as their label.

Code temporarily removed! — to be updated

Next, let's write the code to support our training process. We will need a Torch Dataloader object to iterate through our training examples. We will also create a simple Bert-based neural network.

Code temporarily removed!

Now that we have created a data object that will loop through our training samples as well as a neural network, we need to define a loss function for the network to optimize as well as a training loop to continuously put the network through the job of answering and predicting which query a plot belongs to. Eventually, the trained network will be capable of producing an embedding that can be clustered and then searched.

Code temporarily removed!

With that, we can run the function train to start training!

Visualizing the results through clustering

Recall that we talked about clustering. Let's visualize the clusters in 3-dimensional space using an untrained model and a trained one to see how well our network fared.

Code temporarily removed!

And here are the visualizations. The left shows the clusters obtained via an untrained Deberta model. The colors of the dots indicate their titles as we are clustering the chunked plots of each movie. On the right, the clusters are generated by a model trained for 4600 batches. We can see that there are improvements with regard to how the points are clustered!

However, one of the problems with a naive cross-entropy is that it pulls/pushes relevant/non-relevant passages towards/away from a target point respectively. Thus there are points of different colors (titles) that can be at times close to each other. Additionally, in the process of optimizing for this function, the network will not make full use of the embedding space. This results in the same issue where there is not enough distance between different unrelated points.

A way around this is to implement a simple change to the loss function by incorporating it for passages as well. Specifically, the problems discussed above can be illustrated as follows:

Graphic a) on the left is precisely what we discussed where the red square (a nonrelevant passage) is clustered close to the blue square (relevant passage), while graphic b) is a solution introduced by the authors of the paper PAIR. Essentially, there is a need for a regularization constraint that enforces a distance between passages that are nonrelated. The loss function is as follows:

Notice that this loss is similar to the one we used except that the summed part of the denominator is now changed to be a term that includes similarity between passages.

added onto default loss with L_q being cross-entropy

The loss is added to the cross entropy loss and its importance is weighted using a hyperparameter alpha, with the best results coming from a = 0.1. We will make this change as follow.

Code temporarily removed — to be updated

Now, let's visualize the results using a model trained from scratch for 4600 batches with this new loss.

We immediately see an improvement. Clusters are packed closer together and the problem where points of different colors are close together is also slightly reduced. The space as a whole is also better used and the clusters are better spread out across the area.

Conclusion of part 1

To conclude, in this part 1 of my series on how to build a question answering and search system, I have briefly explained the intuition behind a simple neural network-based system and how it works. I have also introduced a slight modification to the default loss used for training such that the clustering is improved.

Lastly, simple bare minimum training as illustrated above will not produce exceptionally robust models. Thus if you would like to know how to better improve the performance of such models, do stay tuned.

If my article has helped you in any way, do give my article a Clap and if you have any questions, please feel free to ask. You can follow me as well if you like to see more of such articles.

Question Answering and Information Retrieval using Neural Networks

Written by Gan Yun Tian