Eddie: A Knowledge Backed Question Answering Agent — Part 1

Published in

Analytics Vidhya

13 min readFeb 23, 2020

Recently, I finished my Masters in Computer Science from Stony Brook University. I worked with the COMPAS Lab and was supervised by Prof. Mike Ferdman. I spent 2 beautiful semesters researching on developing “Eddie: A knowledge backed question answering agent”. This post sheds some light on my learning during that journey.

This post is Part 1 of a series of posts (Part 2). It is intended for anyone interested in natural language processing, especially in knowing about the internals of an open domain question answering system. Such a system often forms a core component of personal assistants like Siri, Cortana, Alexa, etc.

What’s the Goal?

To develop an intelligent question answering agent that can address the complex information needs of users through interactive and knowledge-backed dialog adapted to each user. We call this system ‘Eddie’.

We envision a dialogue agent to be more human-like, primarily by targeting the following goals.

Primary Goals

Provide an exact answer
We expect the agent to scan through the whole web and provide an exact answer to the user’s query, instead of dumping a set of search results.
Answer contextual/interconnected questions
Such an agent should be able to understand the dialogue context and engage in a series of interconnected question and answer conversation with the user. For example, if a user asks —
i. What is the population of California?
ii. How is the weather there?
The agent should be intelligent enough to understand that ‘there’ refers to the word ‘California’ in the previous question.

Eddie Demo: Topic-Based Open Domain Conversational Question Answering

Secondary Goals

Understand diverse user queries
Natural language is ambiguous. User queries are often short and lack context. Moreover, the same query can be asked in multiple ways. Eddie should be able to reformulate user queries or ask for more context when in doubt.
Respond in meaningful human sentences
The agent should reply in meaningful sentences instead of single words or reading phrases from internet documents.
Provide personalized response
The responses generated by such a conversational interface should be personalized. It should consider the user’s personality traits like age, gender, interests, and past queries before replying. Thus, an agent’s reply to the question “What’s Cancer?” should be different for a 5-year-old child who is interested in constellations than the reply to a doctor.

It is important to note that each of the goals has been achieved by researchers/companies to some extent in an independent setting. However, there has been no successful system yet that has all the primary features of Eddie. Each of the subproblems is a research topic on their own and we will briefly go over each of them in the subsequent posts.

How is Eddie different from existing Chatbots?

Existing Chatbots

Existing voice-based chatbots are broadly classified into 2 categories —

Task-Oriented Bots
Personal digital assistants like Alexa, Siri or Cortana are primarily task-oriented bots. They are closed domain bots i.e. they respond to perform specific tasks. For example, they can easily make a restaurant appointment, play a video or book an Uber. They can do limited social chit chat by telling you a joke or answering wittingly to a few questions. They can also answer some information-seeking questions by identifying relevant documents from the internet and reading through them. Internally, those assistants model the social chit chat and information seeking as two separate tasks among thousands of other tasks. When posed a question, the assistant first classifies the utterance to one of the task categories. This is bucketed as Intent Classification and Slot Filling problem and is often solved using neural networks. Once the task is identified, the relevant API associated with the task performs the work.

Social Chit Chat Bots
Such bots are open domain bots i.e. they can discuss any topic in general. They are either retrieval-based or generative. Given a user sentence and a dialogue, retrieval based bots either retrieve the next response from a repository or a knowledge graph. Generative bots use deep neural networks (trained on existing social chit chats) to generate answers to a user query. Common example of chit-chat bots are — XiaoIce, Meena, Cleverbot

Eddie

Eddie targets to improve a single task of task-oriented bots— the task of seeking information in an open domain. It aims to answer open domain, multi-turn and contextual questions of users.

The existing voice assistants often end up providing canned responses to many questions i.e. their accuracy is very low. Even when they find the answers, their responses are span based. As they just read from a Wikipedia document, their responses are not conversational. We want Eddie to respond in meaningful human sentences. Moreover, the existing assistants are not able to respond to contextual questions in a conversation. We want Eddie to retain context up to the last 1 Question Answer pair at least.

We build Eddie in a very constrained environment. For example, the existing assistants/search engines have a rich context like —

1. What are the most common questions related to this question?
2. What link did the user click when given a set of links?
3. What’s a user personality, ethnicity, location?

They use the answer to these questions while finding out the answer to the user question. As the datasets of such assistants are not publicly available, Eddie does not have all this information. Furthermore, the assistants use both knowledge graph-based QA and neural RC model-based QA to answer questions. Our research is focused on the neural RC model-based QA only.

Goal 1 — Provide An Exact Answer

The most important aspect of our goal is to enable Eddie to answer information-seeking questions. This can be achieved by building an open domain question answering (QA) system. We use the design of a popular QA system, DrQA, as our base. We first build a pipeline which deals with non-contextual questions only. Our system has 3 components primarily.

Overall Architecture of a Non Conversational Open Domain Question Answering System, based on DrQA

Note that we show the answer ranker as a separate module just for clarity. The logic for answer ranker sits inside the document reader itself.

Document Retriever

We use DrQA’s Wikipedia index as our knowledge base.

Given a question Q and a set of documents{d1, d2, …dn}, find top K documents that might contain the answer to Q.

We use a TFIDF model to generate a score for each document. The score denotes the similarity between the user question and the document. Based on the score, we retrieve a set of Top 5 documents that might contain the answer.

Document Reader

On retrieving the documents, we chunk each document into paragraphs, say a total of X paragraphs. For each paragraph, we have now reduced our problem to a standard Reading Comprehension problem in NLP.

Given a paragraph P and a question Q, find the span in the paragraph which has the highest probability of being the answer.

A span in a paragraph is determined by 2 tokens from the paragraph, start token and end token. We use SQuAD v1.1 (a span based reading comprehension) dataset to train a neural model to find the start and end tokens.

Specifically, we encode the question Q and paragraph P tokens using GloVe embeddings. We pass the question and paragraph representations through a 3 layer stacked biLSTM model and use attention between question and paragraph (Stanford Attentive Reader Model) to determine the probability of each token being the start/end token.

During inference, the model provides the paragraph span which has the highest joint probability of start and end token as the answer span.

To see the architecture of our neural model in detail, please read here. Not that fellow students in the COMPAS lab worked towards building an LSTM accelerator. They used my Stanford Attentive Reader (SAR) code to test their accelerator. In order to help them in their development, I did a detailed parameter analysis of the SAR model whose details can be found at the above link as well.

Answer Ranker

After running the neural reader on X paragraphs, we obtain a set of X answers. The existing DrQA model provides scores for each answer.

Score of an answer = max (Pr(start token) x Pr(end token))
DrQA uses the span having the maximum score as the best span.

However, we found that these answer scores are uncomparable across paragraphs. In an open domain setting, most of the paragraphs do not contain the answer. The existing DrQA model is trained on SQuAD v1.1 which does not contain unanswerable questions. Thus the model is not trained to produce low confidence scores for paragraphs that do not contain the answer. To fix this, we used the approach taken by Clark et al. while developing Document QA. We modified the existing DrQA to use shared normalization in the training objective. This generated well-calibrated scores across all the paragraphs.

For our non-contextual pipeline, we used SQuAD 2.0 to train and evaluate the model as it contained unanswerable paragraphs as well.

Goal 2 — Answering Contextual Questions

Our next goal was to enable Eddie to have engaging conversations with the user. It should understand the dialogue context and be capable of answering a series of interconnected questions. For example, consider the queries in the 2nd column of the table below.

Credits — Conversational Query Understanding Using Sequence to Sequence Modeling

To correctly answer such queries, Eddie must keep a context of past conversations and reason through them. We thus propose to include the context in both the retriever and the reader component of our pipeline.

Step 1 — Defining the context

In a reading comprehension based conversational QA, a fixed paragraph is given. The answer to a question either exists in the given paragraph or the question is unanswerable. Popular examples of RC based conversational QA datasets are QuAC and CoQA.

Given a paragraph P and a conversation history (Q1, A1, Q2, A2, …), the goal in an RC conversational QA is to answer the n-th question Qn using the whole conversation history as context. However, in an open domain conversational QA, user can switch between topics. Each question can be from an entirely different domain(say politics to sports), thus making the whole conversation history irrelevant.

The above problem can be solved if Eddie has a “Session Identifier/Topic Detector” module. The module can detect a topic change. It can then calculate the number of turns of previous QA pairs relevant to the topic on which the current question is asked. However, due to time constraints, we skipped the Session Identifier module in our research. We narrow our research to build a Topic-Based Open Domain QA system i.e. we force users to always declare that a new conversation is starting by declaring the topic beforehand.

Our Approach — Topic-Based Open Domain QA

Before starting the conversation, Eddie and the user agree on a topic of discussion. Whenever the user wishes to discuss about a new topic, it can start a new session with Eddie. Thus, all questions/answers in the current session act as the context for Eddie. The user must ask questions related to the agreed topic only. If the user poses a question from a different topic to Eddie, Eddie might return garbage answers as the current question and the context don’t match anymore.

Step 2 — Choosing a conversational dataset

To train and test our pipeline, we need a dataset that will satisfy both our primary goals and secondary goals. We surveyed a large number of datasets looking for the following features —

Primary Features

Information Seeking — Reading comprehension datasets are by default information seeking.
Topic Availability — To evaluate Eddie during inference, we need to pass the topic as context to the retriever and reader. Thus the dataset must have the topic of each conversation available.
Conversational — Dataset should have a series of interconnected questions.
Paragraph Independent Queries — In most datasets, questions are based on a given paragraph. However, in the real world, the questioner does not have a paragraph. Thus, the dataset should have real user questions asked to search engines and QA machines, independent of any paragraph. Such queries are often ambiguous and open-ended. We also want to avoid datasets that have questions created by annotators after seeing the passage.

Secondary Features

Free Form Answer — Dataset should have answers in natural language. It should not be span based. For example,
Paragraph: Ram went to the market. He bought milk.
Question: Did Ram buy milk?
Span Answer: He bought milk.
Natural Answer: Yes
Answering with ‘Yes’ is much more human-like than answering with span.
Personalized Answers — It would be nice to have answers based on personality context as well.

Dataset Survey Results

SQuAD 2.0 and MS MARCO are not conversational. MS MARCO has an advantage that it contains real Bing user queries and the answers in it are in free form. However, they pose an added challenge of answers spread across multiple passages (Multi-Hop Reasoning) which was out of scope for our research. CoQA and QuAC are conversational but they are not based on real user queries. None of the datasets incorporated the user’s personality while answering. We realized that there is no such dataset which qualifies for all our needs. We, thus, narrowed our research to focus on solving the primary goals of Eddie.

In the reduced setting, QuAC was the only dataset that fulfilled 3 out of 4 primary requirements. It did not fulfill the “Paragraph Independent Queries” criteria. To deal with this, we use the training set of the QuAC dataset as it is for training our document reader. But we modify the QuAC validation set by removing all the paragraphs. During inference, we indexed all QuAC validation paragraphs in our Wikipedia index and just pose the questions without the paragraphs to Eddie (Retriever + Reader). We call this modified dataset as Open Domain QuAC, similar to Open Domain SQuAD.

Short Experiment with CoQA

We briefly experimented with CoQA dataset by building a pipeline and always considering past 2 queries as the context. We observed that CoQA performance was quite low during inference primarily due to following reasons —

Questions in CoQA are often too short (What?, Where? When? ) for the retriever to return significant results. The average question length in QuAC is 6.5 tokens while the average question length in CoQA is just 5.5 tokens.
Questions in CoQA were not factoid and often did not have any meaning when posed independently to the retriever. This might be because a questioner in CoQA is allowed to see the paragraph and then ask questions. On the other hand, QuAC contains Wikipedia paragraphs. Thus it’s questions are more factoid, closely resembling the information-seeking nature of users while interacting with voice assistants/web.

3. Another reason for its low accuracy was hidden in our way of creating context. In the experiment, we asked all questions one after the other, using the previous 2 questions as context. For example, consider a dataset with just 2 paragraphs P1 (2 questions Q1 and Q2) and P2 (2 questions Q3 and Q4). If we remove the paragraphs and pose questions (Q1, Q2, Q3, Q4) to Eddie, we will end up using Q1 and Q2 as the context for Q3. We thus end up providing the wrong context to Eddie retriever and reader as Q1 and Q2 were entirely based on a different paragraph and maybe on a different topic.

Step 3 — Incorporating Context

We use the same non-contextual pipeline as mentioned above and just change the inputs provided to the retriever and the reader.

Adding context to the Retriever

Our document retriever is not a machine learning based component. Thus it does not need to be trained. We index all the paragraphs in the QuAC training and validation dataset in our Wikipedia index. Building the Wikipedia index required a system with 120GB of RAM. We perform several experiments on the retriever by passing the topic and QA pairs as context to the retriever.

Eddie Document Retriever Evaluation Results On QuAC Validation Dataset

We observed that passing the topic drastically improved retriever’s accuracy from 11.37% to 64.44%. Similarly, passing all the previous QA pairs along with the topic as context increased the retriever’s accuracy from 11.37% to 66.24%. We also measure the time taken by the retriever to return results on all the questions of the validation sets. The least time is taken when we just pass the topic as context.

Adding context to the reader

Given a question Q, paragraph P, topic T, and a conversation history {Q1, A1, Q2, A2, … }, our goal is to find the answer A.

To fulfill this goal, we modify the input to the Stanford Attentive Reader (SAR) model on QuAC by passing a contextual question. The contextual question is the concatenation of the original question and the context (topic and previous QA pair ). It has the following structure —
<Topic> <Q1><A1><Q2><A2> … <Q>

We retrain the SAR model with the new inputs on a K40 GPU for 26 hours.
The details of adapting DrQA to QuAC can be seen here.

Evaluation results on QuAC Validation Dataset
Exact Match Accuracy — 29.61
F1 Accuracy — 45.87

Recap — Eddie Pipeline

Pipeline Evaluation

During the evaluation, we use the modified validation set of QuAC. We pass the contextual question without the original paragraph as a query to the retriever. The retriever retrieves a list of documents from the index which may or may not contain the correct paragraph. The reader gets multiple answer spans from the retrieved paragraphs. The ranker then provides the span with the highest score. We calculate the exact match evaluation metric by comparing Eddie’s answer span with the golden span.

In an open domain conversational setting, Eddie has an exact match accuracy of 3.03%. DrQA has an overall exact match accuracy of 30% in open domain SQuAD. However, DrQA cannot handle interconnected questions i.e. it’s not conversational. Though our pipeline accuracy is low, we believe that it can act as a baseline to build an open domain conversational system.

In a nutshell …

To the best of our knowledge, Eddie is one of the first attempts at question answering in an open domain conversational setting. We built an end to end pipeline which can provide exact answers to a series of interconnected user questions. We also propose a modified dataset — Open Domain QuAC and evaluate our pipeline. We analyze the reasons for the low accuracy of our pipeline in Part 2 of this series.