AI Outperforms Humans in Question Answering

Review of three winning SQuAD systems

During the past few months, you may have come across news and media coverage such as the following:

In this article, we look at three systems that beat humans on the SQuAD dataset while exploring machine learning technologies which enabled their improved performance in automatic question answering and machine comprehension. Based on the detailed comparison of these models, we evaluate how good they are in addressing real-world consumer needs in the context of online self-service support and digital personal assistance.

We conclude that the field of the automatic question answering has matured enough for dealing with factual question answering, however, further improvements are needed to address consumer needs in descriptive answers, how-to guides, troubleshooting, and other types of requests requiring complex reasoning rather than simple single-word or span answers.

What is Automated Question Answering?

Automated Question Answering (AQA) and machine comprehension (MC) have gathered a powerful momentum recently with advances in Deep Learning, which became an essential tool for NLP (Natural Language Processing) and NLU (Natural Language Understanding). Intelligent personal assistants like Apple’s Siri and Google Assistant are becoming an indispensable part of user experience, with natural language interfaces enabling users to get answers to their questions and delegate various tasks to AI-powered software.

To support the development of the state-of-the-art Machine Learning (ML) models for AQA and MC, a number of large datasets were created. These include Stanford’s SQuAD for automated question answering, MS Marco for real-world question answering, Trivia QA for complex compositional answers and multi-sentence reasoning, CNN/Daily Mail and Children’s Book Test dataset for cloze-style reading comprehension, and many more. Until very recently, however, the existing models failed to outperform human benchmarks in reading comprehension and question answering.

Then, at the beginning of 2018, we witnessed a dramatic breakthrough: ML models independently developed by Microsoft and Alibaba managed, by a small margin, to beat human performance on the SQuAD dataset developed by Stanford NLP Lab.

Within a month, others followed suit. Joint Laboratory of HIT and iFLYTEK Research, along with a model developed jointly by Microsoft Research Asia & NUDT, reached new heights in the AQA. Although the discussed models haven’t yet managed to beat human performance on all metrics (as seen in the F1 metric), the pace of innovation in the AQA is stunning.

SQuAD Leaderboard as of March 25, 2018

How does Automated Question Answering work?

Technically speaking, AQA models predict the best answer for a query (Q), given a passage (P) or a set of passages that contain the answer to that query. The task of an AQA model is to predict the best candidate answers by studying the passage and query interactively and evaluating various contextual relationships between them. With the exponential growth of Big Data and web documents, open-domain AQA that answers questions based on a large collection of documents is becoming the de facto standard in the field. This replaces earlier models, which used closed-domain question answering through custom-built ontologies dealing with a narrow segment of knowledge. Recently, a number of AI and ML technologies have powered the rapid advances in the open-domain AQA, the most important among them being RNNs and attention-based neural networks.


AQA models use various memory-based neural frameworks like RNNs (Recurrent Neural Networks) and their variants, such as LSTMs. RNNs are networks designed to deal with sequential information, such as sentences where inputs are tightly coupled and/or depend on each other. By storing various parts of the sequence in the network’s memory, RNNs can model the contextual relationship between words, phrases, and sentences to enable better translation, information retrieval, and machine comprehension. RNNs and their subsets have powered the ‘encoder-interaction-pointer’ framework underlying most of the contemporary AQA models. In this framework, word sequences of both query (question) and context (passage) are projected into distributed representations and encoded by recurrent neural mechanisms. The attention mechanism is then used to model the complex interaction between the query and the context. A pointer network may then be employed to predict the boundary of the answer.

Attention-Based Neural Networks

To infer relationships from words that form a sequence, an ML model must convert words and their characters into embeddings, which are vectors of real numbers. These vectors capture lexical proximity between words and phrases in the multi-dimensional language space. This approach, however, is limited by long sentences that require very long vectors. Instead, an alternative approach to dealing with the natural language representation was proposed by Bahdanau, Cho, and Bengio (2016) for machine translation. In this approach, each time the proposed model generates a word in a translation, it soft-searches, or attends, for a set of positions in a source sentence where the most relevant information might be concentrated.

For example, in the Figure 1, we see that while translating from French to English, the model attends sequentially to each input state, but it attends to two words simultaneously when translating “la Syrie” to “Syria”.

Figure 1: Attention Mechanism for Machine Translation. From the original paper by Bahdanau, Cho, and Bengio (2016).
This idea has rough analogies with how humans scan the text to find the answer. Human attention, essentially, focuses on certain parts of the input sentence and context one at a time. For example, when we read the text, we focus on the relevant paragraph and then on the relevant sentence continuously refining the results of our search.

More formally, after searching for the most relevant places in the text, the model predicts a target word based on the context vectors associated with these source positions and all previously generated target words. In this way, there is no need to encode a whole input sentence into a single fixed-length vector. Instead, the model just encodes the input sentence into a sequence of vectors and then selects a subset of these vectors adaptively while decoding the answer. This allows the model to better handle long sentences, and even passages, while retrieving better contextual information that can be relevant to the question.

Overview of the SQuAD Dataset and the Task

Three winning AQA models were trained on the Stanford Question Answering Dataset (SQuAD) proposed by Rajpurkar et al. (2016) in the Stanford NLP Lab. The dataset is a high-quality collection of data consisting of 100,000 + questions posed by crowd-workers on a set of randomly collected Wikipedia articles, where the answer to each question is a word or a text span from the corresponding reading passage.

The human and machine performance against the dataset is assessed by two metrics: Exact Match (EM) and F1 score. EM measures the percentage of the prediction points that matches one of the ground truth answers exactly (that is, when the top result was the correct answer). F1 can be thought of as a measure for overlap between the prediction and all ground truth answers. So a F1 of 100% means that the system not only found all answers but also that these answers were the top ranked predictions. As of March 25, 2018, the benchmark human performance on the dataset was 82.304 for EM and 91.2221 for F1.

One of the main benefits of the SQuAD is that it is large enough for data-intensive model training. Also, unlike many other datasets which are semi-synthetic and do not share the same characteristics as explicit reading comprehension questions, SQuAD captures a variety of question types that can be posed. Additional distinctive features of the dataset that might have affected the architecture of the winning models include:

  • SQuAD questions do not require commonsense reasoning and reasoning across multiple sentences.
  • SQuAD involves a span constraint that limits the scope of the answer to a single word or phrase in the passage. This is beneficial because span-based answers are easier to evaluate than free-form answers.
  • Crowd-workers were encouraged to ask questions in their own words, without copying word phrases from the paragraph, to allow for the syntactic diversity of questions regarding paragraph sentences.
  • The authors of SQuAD hypothesized that model performance will worsen with increasing complexity of answer types and with the growing syntactic divergence between the question and the sentence containing the answer.

In general, the SQuAD questions require competing models to account for a number of complex comprehension tasks and contexts, such as the difficulty of questions asked in terms of the type of reasoning required to answer them and the degree of syntactic divergence between the question and answer sentences. The proposed models thus need to account for lexical variations, such as synonymy and world knowledge, and syntactic variations. All three models managed to successfully cope with these challenges. We discuss them in more detail below.

Hybrid AoA (Attention-over-Attention) Reader

Institution: Joint Laboratory of HIT and iFLYTEK, iFLYTEK Research Authors: Yiming Cui et al. 
Performance: EM (82.482), F1 (89.281)
SQuAD Rank: 1

AoA Reader was originally designed for solving cloze-style reading comprehension task which is different from automated question answering. The original model was trained on the CNN/Daily Mail dataset which is a compilation of news articles crawled from the web, where the main body of the news article is regarded as the Document, and the Query is formed by the summary of the article, where one entity word is replaced by a special placeholder that indicates the missing word. This initial setup was extended to address AQA and to be compatible with SQuAD dataset.

The main idea of the AoA reader is to create the additional layer of ‘attended attention’ over standard individual attentions that would involve not only query-to-document attentions but document-to-query attentions. By merging these attentions together the model could introduce a deeper layer of contextual understanding that secures better performance on all metrics. Thus, the discussed model extends a standard attention-based neural network model with the attention-over-attention mechanism (AoA). According to the authors of the Reader, this solution has the following benefits:

  • In contrast to models with complex architectures or many non-trainable hyper- parameters, the AoA reader is much more simple while simultaneously being more efficient.
  • The AoA model performs well on large documents, especially when the document’s length exceeds 700 words. In fact, the improvements of the model become more tangible on the longer documents.
  • The model performs well even if the right answer is less frequent in the document that other candidates.
  • The model includes an N-best ranking strategy that allows re-scoring candidates and further improve the performance of the model.

The AoA Reader Architecture

AoA Reader is implemented as an ensemble model made up of four best models, which are trained using different random seeds. The model’s architecture implements tight coupling of the query and the document to ensure the exchange of necessary information and contexts between them.

Figure 2: Architecture of the AoA Reader (From the original paper by Yiming Cui et al. (6 June 2017)

The first step toward creating layered attentions that flow from the query to the document and vice versa was to design a shared embedding matrix We that includes continuous representations from the query and the document. According to the authors, by sharing word embeddings, both the query and the document can participate in the learning of mutual embeddings and benefit from the information contained in them.

In the augmented Hybrid AoA Reader, the authors also used a character embedding layer just like in Bi- Directional Attention Flow (BiDAF) model for Machine Comprehension that currently ranks fourth on the SQuAD dataset. The character embedding layer maps each word to a vector space using character-level CNNs.

To create contextual representations of the document and the query individually, the authors opted for two bi-directional RNNs with the GRU (Gated Recurrent Unit) as recurrent unit implementation. Bi-directional RNNs with GRU are very efficient in forming representations of each word by concatenating the forward and backward hidden states (in other words, each word vector is supplemented with information about other hidden states both before and after this word).

In the second step, to represent the query and document in terms of each other, the authors calculated a pair-wise matching score, which indicates the pair-wise matching degree of one document word and one query word. The matching score was obtained by simply calculating the dot product of the query and the document word. Then, to create individual query-to-document attentions, the authors applied a column-wise softmax function for each column of the pair-wise matching matrix M, where each column is an individual document- level attention when considering a single query word. However, the authors did not stop there. They wanted to represent the query in terms of the document and to combine query- level and document-level attentions into another attention layer that accounts for contextual relationships between the document and the query.

To implement this, the authors first needed to construct document-to-query attention that defines the importance of the query word in terms of the document. To retrieve the query- level attention, the authors calculated a reversed attention: that is, for every document word at time t, they calculated the “importance” distribution on the query, to indicate which query words are more important given a single document word. In technical terms, this was achieved by applying a row-wise softmax function to the pair-wise matching matrix M. Thus, applying softmax function both to columns and rows of the pair-wise matching matrix, the authors obtained both query-to-document attention α and document-to-query attention β.

In the augmented model for the Hybrid AoA Reader, the authors improved this approach by refining query-aware and doc-aware representations multiple times. Also, the improved model proposes to use historical attentions to enhance long-term memory of the model.

Finally, to get attention-over-attention or “attended document-level attention”, as the authors call it, they calculated the dot product of α (query-to-document attention) and β (document-to-query). In a nutshell, this operation is the same as calculating a weighted sum of each individual document-level attention when looking at query word at time t. In this way, the model learns the contributions of each query word explicitly, and the final decision (document-level attention) is made as a voted result by the importance of each query word.

II Reinforced Mnemonic Reader + A2D (ensemble model)

Institution: Microsoft Research Asia & NUDT (A2D model: public information is not available)
Reinforced Mnemonic Reader described in Hu, Peng, and Qiu (2017) Performance: EM: 82.849, F1: 88.76

SQuAD Rank: 1

This winning model consists of two components, however, public technical information is only available for the Reinforced Mnemonic Reader designed by the National University of Defense Technology and Fudan University. There are several evident differences of RMR from other winning SQuAD models and state-of-the-art approaches in the NLU field. They are the following:

1. Unlike other models like R-Net (that will be discussed later) which represent each word only with word-level and character-level embeddings, Reinforced Mnemonic Reader also incorporates additional syntactic and linguistic information, such as parts-of-speech tags (POS), query category, and named-entity (NER) tags all of which are important for MC task.

2. RMR attempts to capture the long-distance contextual interaction between parts of the context by extending LSTM and gated recurrent units (GRU) models with additional layers. 3. modifies the standard usage of the pointer networks that calculate probability distributions for the start and end position of the answer. According to the authors, the conventional approach ignores cases of multi-sentence reasoning and when there may exist several candidate answers. Also, boundary detection method used in most pointer networks implementation may fail when the answer boundary is fuzzy or too long, such as the answer of the “why” query.

4. To obtain a fully-aware contextual representation, the authors iteratively align the context with the query as well as the context itself and then efficiently fuse relevant semantic information into each context word. Based on such representation, they then design a memory-based answer pointing mechanism that allows the model to gradually increase its reading knowledge and continuously refine the answer span.

5. To directly optimize both the F1 score and the EM metric, the authors introduced a new objective function combining the maximum-likelihood cross-entropy loss with rewards from reinforcement learning. Taking the F1 score as a reward, the authors use the REINFORCE algorithm to maximize the model’s expected reward. To stabilize training and prevent the model from overwriting its earlier training, the authors combine the maximum- likelihood estimation with the reinforcement learning through a linear interpolation.

Similarly to the Hybrid AoA reader, Reinforced Mnemonic Reader models co-attention and self-attention mechanisms. However, it implements them in a different way to account for long-term dependencies of the context. Like in Hybrid AoA reader, the co-attention in RMR model is computed as an alignment matrix corresponding to all pairs of context and query words, which can model complex interactions between the query and the context. Unlike the AoA Reader, however, in RMR, self-attention is computed as a self-referential representation that allows aligning the sequence with itself. Also, in addition to standard contextual representations that may be found in other models, the RMR uses a reasoning mechanism inspired by the phenomenon that human beings improve their understanding of the passage by rereading the context and the query. To implement this mechanism, the authors use multi-hop reasoning in a form of memory-based answer pointer which is able to continuously refine the answer span.

Figure # 3 Architecture of the Recurrent Mnemonic Reader (From the original paper by Hu, Peng, and Qiu (2017)

In general, RMR consists of three basic modules: feature-rich encoder, iterative aligner, and memory-based answer pointer, which are depicted in Figure 3 above. Below we discuss them in more detail.

The feature-rich encoder maps word sequences from the query and the context to their corresponding word embeddings and encodes these embedding for further processing. To enhance the capacity of the encoder, both lexical and syntactic features are used. In particular, the authors enrich their model with a simple yet effective binary feature of exact matching (EM) that indicates whether a word in context can be exactly matched to one query word, and vice versa. In addition, the authors created look-up based embedding matrices for named-entity tags and parts-of-speech tags.

Another layer of the model is the iterative aligner that completely reads the context and the query over T hops. In the t-th hop, the aligner first attends to both the query and the context at the same time to capture the interaction between them and generate the query-aware context representation (interactive aligning). In the next step (self-aligning), the iterative aligner further aligns the query-aware context representation with itself to synthesize the contextual information among context words. Such approach overcomes the limitation of recurrent neural networks in modeling long-term dependencies of contexts. In the standard implementation, each word is only aware of its surrounding neighbor and has no cues about the entire context.

Finally, the third layer of the model is memory-based answer pointer which maintains a memory vector to record necessary reading knowledge to continuously refine the predicted answer span using reinforcement learning algorithm mentioned above.

III R-Net+ (ensemble) Microsoft Research Asia

Institution: Microsoft Research Asia Authors: Wang et al. (2017) Performance: EM: 82.650, FI: 88.493 SQuAD Rank: 2

R-Net+ was designed to work with SQuAD and Microsoft Machine Reading Comprehension (MS-MARCO) datasets both manually collected through crowdsourcing. As in the case of other winning models, the R-Net+ is designed an ensemble model that consists of 18 training runs with the identical architecture and hyper-parameters. At test time, the model selects the answer with the highest sum of confidence scores amongst the 18 runs for each question.

The proposed model consists of four main parts: (1) The recurrent network encoder for building representations for questions and documents individually. (2) The gated matching layer to match the question and passage (in contrast to the pair-wise matching matrix used in the Hybrid AoA Reader). (3) The self-matching layer for aggregating information from the whole passage similarly to attention-over-attention and self-referential mechanisms used in the above-discussed models. (4) The pointer-network based answer boundary prediction layer.

Similarly to the approach used in the Hybrid AoA Reader, the R-Net authors created a gated attention-based recurrent network with an added gate to account for the differential importance of the words in the passage to answer a particular question. Through the additional gating mechanism, the gated attention-based network can assign different levels of importance to passage fragments depending on their relevance to the question, masking out irrelevant passage parts and emphasizing the important ones. The approach of masking out irrelevant passage parts is, however, slightly different from the self-attention and self- matching mechanisms in the Hybrid AoA Reader and RMR in which each word of a passage is represented in terms of other parts of the passage and related to the answer.

Similarly to the Hybrid AoA Reader and RMR, R-net+ uses a self-matching mechanism that effectively aggregates evidence from the whole passage to infer the answer. Via a gated matching layer, the resulting question-aware passage representation effectively encodes each question’s information for each passage word. This is equivalent to the Hybrid AoA

Reader’s query-to-document attention layer. R-Net+ also accounts for the fact that recurrent networks can only memorize limited passage context in practice. To solve this issue, the model employs a gated attention-based recurrent network on passage against the passage

Figure 4: R-Net Architecture (From the original paper by Wang et al. (2017)

itself, aggregating evidence relevant to the current word in a passage from every word in this passage. As a result, a gated attention-based recurrent network layer and self-matching layer dynamically add information aggregated both from the question and passage parts to passage representation, enabling better prediction of right answers.

R-Net+ Architecture

The first layer of the R-Net+ is the Question & Passage Encoding Layer. Similarly to the RMR, the question and passage words are first converted to their word-level and character- level embeddings (using GloVe embedding) and as in the Hybrid AoA Reader are encoded by a bi-directional recurrent network to produce new representations. For the gated mechanism, similarly to the Hybrid AoA Reader, R-Net+ uses a Gated Recurrent Unit (GRU) implementation which is computationally cheaper than LSTM. This gate is important because it allows determining the importance of information in the passage regarding a question. At the same time, however, the authors use another gated mechanism different from LSTM and GRU. It is based on the current passage word and its attention- pooling vector of the question, which models the relationship between the question and the current word. This additional gate effectively models the phenomenon that only parts of the passage are of importance to the question.

The second layer performs question-passage matching with gated attention-based recurrent network, obtaining the question-aware representation of the passage.

On top of that, in the third layer, the model applies self-matching attention to aggregate evidence from the entire passage and refine the passage representation. Similarly to self- matching approach in RMR, the self-matching mechanism in R-Net+ addresses the fact that the question-aware passage representation so far has a very limited knowledge of the context. Moreover, there exists some sort of lexical or syntactic divergence between the question and passage in the majority of SQuAD dataset. To solve this issue, the authors proposed directly matching the question-aware passage representation against itself. It dynamically collects evidence from the whole passage for words in the passage and encodes the evidence relevant to the current passage word and its matching question information into the passage representation. At the same time, however, the model does not propose a document-to-query representation that defines the importance of query words in terms of the passage as the Hybrid AoA Reader does.

Finally, in the output layer, the R-Net+ uses the pointer networks to predict the start and end position of the answer. To generate the initial hidden vector for the pointer network, the authors used an attention-pooling over the question representation. To train the pointer network, the authors minimize the sum of the negative log probabilities of the ground truth start and end position by the predicted distributions. This approach is different from the maximum-likelihood estimation with the reinforcement learning proposed in RMR model.

Strengths and Limitations of the SQuAD Winning Systems

All three models discussed in the aforementioned article dramatically improve state-of-the-art AQA by increasing the rate of correct answers, enhancing understanding of the context, and managing long-term contextual memory. The machine comprehension models for the single-word and span answers represented by these models have matured enough to be used in commercial products. However, use cases addressed in the SQuAD dataset and the winning models do not cover all possible scenarios and requirements of the automated question answering in the context of consumer support, online search, and digital personal assistance. These limitations include:

  1. All three winning systems are “ensemble” systems, not single-model systems. Their focus was on getting right answers, but not on real-time performance or cost of deployment, which are lacking.
  2. All three models are adapted to a situation where the answer always falls into one sentence. In the real world, however, the answer can span beyond one sentence. Dealing with such cases is much more difficult since the existing models for sentence ranking now significantly underperform “answer span” ranking. This indicates that the exact span information is, in fact, critical in selecting the correct answer sentence. This limitation was explicitly mentioned in the R-Net+ authors’ directions for future research and it constitutes a tangible problem to the multi-sentence reasoning in AQA.
  3. Many questions in the SQuAD are quite diverse in terms of reasoning required and syntactical divergence between the question and the answer. For example, the dataset involves question-answer pairs with lexical variation (synonymy and world knowledge), syntactic variation from the passage sentences, and partial multi-sentence reasoning. Still, most questions in the dataset are factual questions, which actually constitute a small part of questions asked by consumers and Internet users. In particular, when it comes to support services, people tend to ask descriptive questions and questions asking for the steps to solve a problem. Such types of questions are relevant for a number of use cases, such as:
  • Support deflection: When a user is opening a ticket, AQA software must be able to render solutions from manuals, user guides, discussion forums, and existing tickets.
  • Guided troubleshooting: When a user describes a problem, the system then guides the user through trying various steps one-by-one, with the user entering findings after each step.

Answering such questions requires deep world knowledge, multi-sentence reasoning, and other approaches that are not currently addressed in the SQuAD dataset and described models. To create even better AQA models, we need a combination of what these winning systems have shown, along with automated question/answer generation and models where answers are constructed from several sentences and/or paragraphs.


  1. What has improved is one tool in the toolkit of production-ready, natural language understanding systems.
  2. Building a question answering system solely using deep learning techniques is still out of question.
  3. Do not expect question answering systems to be 100% accurate. Even in the presumably simpler SQuAD dataset, humans could only score just above 82%.