Implementing QANet (Question Answering Network) with CNNs and self attentions

In this post, we will tackle one of the most challenging yet interesting problems in Natural Language Processing, aka Question Answering. We will implement Google’s QANet in Tensorflow. Just like its machine translation counterpart Transformer network, QANet doesn’t use RNNs at all which makes it faster to train / test.

I’m assuming that you already have some knowledge of python and Tensorflow.

Question Answering is a field in computer science that has seen some rapid progress in the past few years. A classic example of question answering is IBM’s Watson competing at the famous quiz show Jeopardy! in 2011, facing off legendary champions Brad Rutter and Ken Jennings and winning the first place prize.

Here in this post, we will focus on open domain reading comprehension, where the questions can come from any domain ranging from American pop stars to abstract concepts. Reading comprehension is a type of question answering where we are given a paragraph and asked questions specifically chosen to be answered from the paragraph.

IBM Watson competing against Ken Jennings (left) and Brad Rutter (right) at Jeopardy! in 2011. Source: https://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/

Dataset (SQuAD)

The dataset we will be using for this post is called Stanford Question Answering Dataset (SQuAD). SQuAD has some problems which we will come back to, and it is arguably not the best dataset for machine reading comprehension but the most widely studied. If you are curious about different datasets available for reading comprehension, also check out this Awesome list of NLP datasets.

One of the differentiating factors of SQuAD is that the answer to the question lies within the paragraph itself. Here is an example of SQuAD format.

An example of Stanford Question Answering Dataset. Source: https://rajpurkar.github.io/SQuAD-explorer/

As we can see from the example above, no matter how easy or difficult the question may be, the answer to that question always lies within the paragraph itself.

But first, let’s have a look at the kinds of questions and answers we are expecting to solve. More often than not, the questions themselves are already paraphrased versions of the segments of the paragraphs. For example,

P: “Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other. A computational problem is understood to be a task that is in principle amenable to being solved by a computer, which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps, such as an algorithm.”
Q: “What is the term for a task that generally lends itself to being solved by a computer?
A: “computational problem”

From the words highlighted in bold, it is clear that the question is paraphrased from the paragraph. This property makes SQuAD problem inherently easier than an open-domain question answering problem. Because all we need to do now is to find a sentence from the paragraph that semantically matches the question and extract the common semantic or syntactic factor from the context. Although we still require solving semantic and syntactic extraction, this is much easier than deducting the answer from a dictionary that may have tens of thousands of words.

Downside of SQuAD

The property mentioned above lets us use a few tricks to easily predict answers from a given paragraph. However, this also introduces some problems to the model trained with SQuAD. Because the models heavily rely on finding the correct sentence from the paragraph, they are vulnerable to adversarial sentences inserted to the paragraph that resembles the question but designed to fool the networks. Here is an example,

Adversarial example of SQuAD. Source: https://arxiv.org/pdf/1707.07328.pdf

The sentence highlighted in blue is the adversarial example inserted to fool the network. To human readers, it doesn’t change the answer to the question “What is the name of the quarterback who was 38 in Super Bowl XXXIII?” as the adversarial sentence is talking about Champ Bowl XXXIV. However, to the network, the adversarial sentence aligns better with the question than the ground truth sentence.

The Model Network (QANet)

The reason behind our choice of the model QANet is simple. Thanks to its straightforward architecture it is easy to implement and faster to train than most of the networks for the same task. QANet architecture can be seen from the figure below:

The network architecture overview. Source: https://openreview.net/pdf?id=B14TlG-RW

The model network can be separated into roughly 3 sections.

  1. Embedding
  2. Encoder
  3. Attention

Embedding is where text inputs (paragraphs and questions) are converted into representations in the form of dense low-dimensional vectors. This is done via an approach similar to this paper.

Character aware language modelling. Source: https://arxiv.org/pdf/1508.06615.pdf

Our approach is very similar. The only difference is that we use a fixed kernel size of 5 for our convolutional filter. We also concatenate the word representation with the max pooled character representation before we put them to the highway network.

The encoder is the basic building block of the model. The details of the encoder block can be seen from the right side of the figure above. Encoder consists of a positional encoding, layer normalization, depthwise separable 1d-convolution, self-attention and feed-forward layers.

Finally, attention layer is the core building block of the network where the fusion between question and paragraph occurs. QANet uses trilinear attention function used in BiDAF paper.

Let’s get started!

Implementation

For simplicity, we skip the data processing step and jump straight into the neural network.

Embedding

First, we define the input placeholders. Once we define the placeholders, we embed the word inputs with word embeddings, and character inputs with character embeddings.

Then we put them through a 1 layer 1-dimensional convolutional neural network, max-pooling, concatenate word + char representations and finally put them through 2 layer highway networks. The reason why we put “reuse” argument in “conv” and “highway” functions is that we want to use the same network for BOTH paragraph and questions. “conv” and “highway” functions are our Tensorflow implementation of convolution network and highway network. (The source code for conv and highway functions will be made available soon.)

We put the outputs of the embedding layer into the encoder layer to generate the corresponding context and question representation. “residual_block” implements positional encoding -> layer_normalization -> depthwise separable convolution -> self attention -> feed forward network. (The source code for residual block will be available soon.)

Now that we have context and question representation, we fuse them together using an attention function called trilinear attention. The fused output now has rich information of the context with respect to the question. Again, we concatenate the context-to-question and question-to-context information along with the context and pass on as the next input to the encoder layer.

Finally, we have the output layer which takes in the attention output and encodes them into a dense vector. This is where the property of SQuAD comes in useful. Because we know that the answer is somewhere inside the paragraph, for each word in the paragraph we just need to calculate the probability the word being the answer or not. Practically, we calculate two probabilities. A probability that the corresponding word belongs to the beginning of the answer and a probability that the corresponding word belongs to the end of the answer span. This way, we don’t need to figure out what the answer may be from a large dictionary and efficiently calculate the probability.

That’s it!

Training and demo

QANet trains relatively quickly compared to other RNN based models. Compared to the popular BiDAF network, QANet trains roughly 5~6 times faster with better performance. We train the network for 60,000 global steps which take around 6 hours in GTX1080 GPU.

Visualizing results in Tensorboard. The top plots are the devset results and the bottom are the training results. “em” is exact match, “f1” is F1 score.

That was quite minimalistic approach but I hope it helped you understand question answering with neural networks! If you have questions about them just leave a comment and I’ll try to answer your questions.

Thanks for reading, and please leave questions or feedback in the comments!