Question Answering Systems. Why? What? How?
Why should I care?
Google Assistant, Alexa, Siri, Cortana and Bixby are a collection of feminine names (excluding the first creatively titled entry in the list) that refer to the virtual assistants that are on the market today. Perhaps you’ve asked your assistant to tell you the weather or to sing you a song, asked Google Translate to help you with your French essay, or asked Wolfram Alpha to plot you a graph. All of these technologies are applications of Question Answering (QA) systems.
In most cases, like an engine in a car, the QA system is only a part of the whole product. Just like there are petrol and diesel engines, there are various kinds of QA systems and we will discuss them below. This post is intended to be as accessible to both technical and non-technical audiences as possible, serving as an introduction to the subject. I aim for this to serve as a jumping-off point for those interested in the subject and will share further learning resources at the bottom of this article.
What is ‘Question Answering’?
Question answering is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language. There are 2 types of systems; Closed-Domain QA and Open-Domain QA.
Closed-Domain QA is about building systems that answer questions from a specific domain and the questions are usually restricted to be of ‘descriptive’ nature. So for example, say we used this article as the input to our closed-domain QA system, we could ask it ‘What are some popular names for virtual assistants?’, but we couldn’t ask it something like ‘Why are the author’s jokes so bad?’. The idea being that the closed-domain QA systems are exposed to much smaller datasets from which to extract the answer, which naturally restricts the range of questions that can be asked (given that answers have to ‘come from somewhere’).
Open-domain QA on the other hand deals with questions about nearly anything, and can rely on general ontologies and world knowledge. However, these systems usually have much more data available from which to extract the answer. They can answer questions like “What is the meaning of life?”, “Will you replace me in my job” (and other existential types) and are the types of systems that virtual assistants use.
It’s important to note that QA systems can be created to deal not only with textual data, but audio, images and video as well. It is the implementation of the system that defines its capabilities in dealing with various data types.
How can I build a Question Answering system?
For the sake of brevity let’s consider an implementation of a textual closed-domain Question Answering system using a Dynamic Memory Network (DMN). The reason for choosing the Dynamic Memory Network is twofold; it achieves state-of-the-art performance in various QA and NLP tasks, and the modular architecture of the DMN allows it to implement various kinds of QA systems. More concretely, it’s a ‘neural network based framework for general question answering tasks’ (more on this later).
What you see above is the high level overview of the Dynamic Memory Network, it might look confusing, but we will break it all down step by step. There are 5 ‘modules’ to the DMN; Semantic Memory Module, Input Module, Question Module, Episodic Memory Module and Answer Module.
Input data is usually in the form of context-question-answer triplets. Where the context is the text of which the question will be asked, the question is the question that is asked and the answer is the correct answer to the question. In the example above the context is the 8 sentence story (that you can see in the Input Module) the question is “Where is the football?” and the answer is “hallway”. Answers are only available during ‘training’ and validating/testing of the model, since the whole idea of the DMN is to correctly generate the right answer for unseen datasets and questions.
Semantic Memory Module
For the most part machine learning algorithms can only work with numbers and it is the job of the Semantic Memory Module to provide a mathematical representation of the input data, this is called word vectorization/word embedding. Let’s take the sentence “I ❤ cars, racing and pizza!” for example. To vectorize this sentence we first need to split it up into its parts (this is called ‘tokenization’), these ‘parts’ are called tokens. The tokens in our sentence are: “I”, “❤”, “cars”, “,”, “racing”, “and”, “pizza” and “!” (note this includes punctuation and symbols).
GloVe and word2vec are 2 of the most prominent word embedding methods that allow us to see the relationships between tokens. They’re called word embeddings since the vectors ‘embed’ the meanings of the words (common vector dimensions include: 50, 100, 200, 300). The dimensions themselves don’t have any interpretable meaning, however using dimensionality reduction techniques (check out t-SNE and Principal Component Analysis) we can visualise the relationships between words.
King is to queen what a man is to a woman, sweet huh? Fascinating when you think that the main idea behind the unsupervised machine learning algorithms that create these word embeddings is “You shall know a word by the company it keeps”- John Rupert Firth, 1957.
The purpose of the Input Module is to provide an embedded representation of the context. This means that the meanings of the words and sentences are captured by fixed length vectors, called ‘facts’. A ‘fact’ is defined as an embedded representation of a sentence in a context. So if our context has 4 sentences, the Input Module will output 4 facts.
Before we get into the implementation of the Input Module, we need to get the basic intuition about neural networks (as mentioned above, this will be a very ‘bare bones introduction’). Neural networks are a subset of Machine Learning models whose architecture is a collection of ‘layers’ of nodes and weights. There are many types of neural networks and what you see below is called a ‘feed-forward neural network’. It is the most simple and popular class of neural nets.The nodes and weights are numerical values that are multiplied together and propagated (‘fed-forward’) along the arrows in the graph. A function (called an ‘activation function’) is then applied to the sum of values entering a node to compute final value for that node. The job of the activation function is to introduce the possibility for non-linear relationships to the network.
The input layer is where we enter our raw data points (x1,x2,x3 and so on) and the output layer outputs values from the network. The hidden layer contains values calculated from layers prior, however the values in the nodes of the hidden layer have no interpretable meaning. For each ‘sample’ of the dataset a new vector is output from the output layer ([y1,y2] above), and each calculation is independent of any other (relying only the data from the input layer).
Recurrent Neural Networks (RNNs) are another class of neural networks (see the diagram below) that are more suited to sequential data. They differ from their ‘feed-forward’ counterparts by including the values output by the hidden state (shown as the green box with the letter A below) in the calculation for the next ‘timestep’ as well as data from the input layer (blue circles below).
For textual QA the Input Module is implemented using a more sophisticated type of RNN called a Gated Recurrent Unit (GRU). I will skip the details of the GRU here.
Similarly, to the Input Module, the Question Module takes in questions and returns a question embedding vector. Given that questions are usually one sentence long there is only one question embedding vector. Once again, a GRU is used to produce the output.
Episodic Memory Module
This is where things get interesting. The high-level idea of the Episodic Memory Module is to iterate over the facts returned by the Input Module (in light of the Question, represented by the output of the Question Module) multiple times, in order to output a memory. This memory will be passed to the answer module to produce an answer. The episodic memory module chooses which facts to focus on through the attention mechanism (by computing scalar attention scores, which you can see in the DMN diagram). It produces a ”memory” vector representation taking into account the question as well as the previous memory for each iteration. Only the final memory is passed to the answer module.
Attention and memory are huge topics and the magic of the DMN (giving it its name). It is those memory iterations in the network that allow it to exhibit ‘transitive reasoning’-like capabilities. There are various implementations and definitions of attention mechanisms; soft attention, GRU based attention and so on.
Finally the Answer Module generates an answer given the final memory output vector from the Episodic Memory Module. Once again there are many kinds of answer modules implementations. For single word answers, a single layer feed-forward neural net is a viable option, however a GRU is recommended for multi-word predictions.
It is this modularity that allows the DMN to be used as a framework for general Question Answering.
Final Dynamic Memory Network Remarks
As I mentioned above, the modular architecture of the DMN allows you to change out modules (as opposed to building completely new systems) to allow the DMN to solve any NLP task that can be formulated as a Question Answering problem.
A few examples of technologies/problems that can be treated as QA problems:
- Neural Machine Translation (Google Translate)
Q: “What is ‘Hello my name is Lukas’ in French?”
A: “Je m’appelle Lukas”
- Sentiment Analysis
Q: “What is the sentiment of ‘DMN is awesome’?”
- Named Entity Recognition/Extraction (picture)
Q: “What are the named entities and their groups?”
A: *answer in picture*
Examples are numerous, and obviously there are limitations, however the DMN is a very capable and recent development (2016).
Hopefully you now have an appreciation for the applicability of QA systems in modern technology, a working understanding of what QA systems are and how they might be implemented. Maybe you’ve become inspired to experiment more with these technologies (ask yourself ‘why am I not yet able to have a continuous discussion with Google Assistant?’) and perhaps even find new applications for them.
Learning materials and other useful links
Stanford NLP and Deep Learning course
Prerequisites: linear algebra and some multivariate calculus
Includes: Word embeddings (word2vec, GloVe), Neural Networks (Feed-forward, various kinds of RNNs, Convolutional neural nets), Neural Machine Translation, Dynamic Memory Networks and much more.
My closed-domain Question Answering system project (GitHub repo)
Includes: Full code (in Jupyter Notebook), detailed documentation and project final report. If you enjoyed reading this and want to get into the details of the implementation, then this is the link for you.
Dynamic Memory Network (research paper, full meat version)
Prerequisites: Linear Algebra, training/testing of Machine Learning Models, knowledge of Neural Networks (Feed-forward, RNNs)
Includes: Everything you might need to know about Dynamic Memory Networks, implementation details, performance figures. Highly recommend, this was of my key resources when I was building a QA system.
Dynamic Memory Network (YouTube talk from one of the founders)
Prerequisites: This varies, he covers it quite nicely so there’s alot to take away even if you know nothing.
Includes: Overview of DMN.