BazingaBot

Francesco Fabbri
12 min readMar 16, 2018

--

I present BazingaBot: a simple but effective Knowledge Chatbot

(link to GitHub repository: https://github.com/FraFabbri/bazingabot)

Harnessing the potential of a Knowledge Base Server (KBS), our dialog system not only adeptly responds to inquiries but also refines its expertise through interactive questioning.

Abstract

In the Querying step, the system leverages a Deep Learning framework. It begins by predicting the question’s type using a basic LSTM architecture. Subsequently, based on this type, a specific neural network (NN) model is activated. Simultaneously, during the Enriching phase, our BazingaBot generates questions by examining the gaps in the KBS data. To accomplish this, it utilizes the Babelfy framework to extract concepts from the generated Q/A pairs.

In our Deep Learning context, a distinct NN architecture has been developed for each question type. This selection is guided by the data’s structure, which exhibits an imbalanced distribution of question types. Three primary question types have been identified: (a) the generic open-ended type, where each question has at most one corresponding answer; (b) the multiple-choice open-ended type, wherein several answers can be associated with the same question; and finally, (c) the closed-ended type, encompassing Y/N questions, which constitute the majority of Q/A pairs in our dataset.

For the first two question types, we employ a straightforward yet effective Seq2Seq model, while for the remaining type, we implement a precise LSTM architecture. The latter addresses a binary classification task, specifically predicting Y/N responses.

1 Introduction

Interacting with machines through natural language is undeniably one of the most dynamic fields in Deep Learning. Dialogue systems, including chatbots, are designed to offer users informative responses. BazingaBot can be classified as goal-oriented, as it focuses on providing answers related to a specific topic after that topic has been designated.

Main Features. In the Querying step, when presented with a specific pair (domain, question), our Bazinga-Bot demonstrates the ability to:

  • Predict the relation
  • Identify the question type
  • Utilize all available information within the dataset.

In the Enriching phase:

  • It generates questions aimed at enhancing its knowledge.
  • It analyzes the answer using the Babelfy method.
  • It retains the acquired information to enhance the existing dataset.

User check. In both scenarios, BazingaBot consistently engages with the user to enhance the quality of its knowledge and learning process. The user is prompted to review the coherence and logic of the obtained Q/A pair during the conversation. This functionality empowers the user to rectify misspelled and nonsensical Q/A pairs, contributing to a refined learning experience.

Garbage in, garbage out. It’s essential to highlight that the KBS data is derived from extracting information and concepts from Wikipedia pages, primarily consisting of meta-data. While typical chatbot training data often involves annotated corpora, this particular setup — though somewhat limiting — requires a rigorous cleaning process for the Q/A pairs prior to training our NN architectures. This involves the elimination of all “bad” pairs that lack meaning or relevance to our intended goal.

2 Data

In this section, we delve into the processed dataset, where our primary focus is to construct a straightforward yet substantial collection of data. Throughout this process, we grapple with a pivotal trade-off: meticulously cleaning the dataset to eliminate all nonsensical sentences while striving to retain crucial information. Given that analyzing the structure of each individual sentence is infeasible, we turned our attention to the key patterns identified through statistical analysis applied to the data.

2.1 Cleaning KBS entries

After obtaining over 1 million entries from the KBS, we retained those that include BabelNet IDs linked to the relevant concepts in the observations. By examining the distribution of the number of concepts present in each question, we opted to exclude pairs with more than 2 involved concepts, comprising around 12% of the dataset. Subsequently, we established connections between the concepts and domains using edges derived from the application of BabelDomains, a method that enhances lexical items with domain-related information [1]. It’s important to note that during this stage, we experience a reduction of approximately 28% in the dataset size. Despite these operations, the dataset ultimately encompasses 658,946 entries.

2.2 Building Dataset

Next, we identify and eliminate various redundant and nonsensical patterns prevalent in the Q/A pairs.

Special Characters. To begin, we filter out pairs containing special characters, as these sentences are generally uninformative. This culling process only results in a marginal dataset reduction of around 2.6 percent.

Analyzing Distributions. In an effort to characterize the dataset, we segregate the pairs into two categories: closed-ended questions (Y/N) and open-ended questions. Concentrating on the open-ended type, we explore the distribution of sentence lengths in terms of word count, seeking to streamline the resource-intensive Seq2Seq pre-processing step that stems from the One-hot Encoding phase during NN architecture training. As illustrated in Figure 1, a significant portion of Q/A pairs align with the initial values of these distributions. This insight enables us to selectively filter pairs without significant information loss. Opting to retain pairs containing a maximum of 10 words for questions and 16 words for answers, we experience a mere 3% reduction in the entry count.

Figure 1

Redundancy and Stopwords. At this point there are still some bad patterns which characterize meaningless answers, as the followings:

  • How can I use IAST? — It can be
  • What is Hardt Railway? — line in
  • Where is located Dachau concentration camp? — state of
  • What is a mass? — property of
  • Where is City of Armagh High School? — city of
  • Where is located Enonkoski? — province of

We employ a straightforward heuristic to eliminate such pairs: by analyzing the Part-of-Speech tags (POS-TAGs) [2] of the answers, we filter out those whose last word is a coordinating conjunction (CC), preposition (IN), subordinating conjunction (IN), or modal verb (MD). Additionally, we exclude answers ending with the article ‘the’ and the verb ‘to be’. Furthermore, we implement another heuristic to remove answers that fail to provide meaningful information to the user: specifically, we discard answers entirely contained within the questions themselves. This process effectively eliminates pairs like the examples below:

  • What is an example of a television broadcasting? — television
  • Where is Ossipee Lake located? — Ossipee
  • What is Religious clothing? — clothing
  • Where is Linden Railway Station placed? — Linden
  • Where can Perimeter Center be found? — Perimeter

Answer Selection. When considering open-ended questions, we encounter a specific scenario involving entries associated with the same tuples (question, domain-relation) but differing in their answers. This gives rise to two primary situations. In the first scenario, it’s plausible that all provided answers are correct, and none holds an advantage over the others. Conversely, the second scenario entails that only one answer, among the available options, is accurate. In essence, determining which answer is the most coherent or meaningful becomes a challenge. Considering the training phase of our model with the available data, we must address this dilemma to prevent ambiguity.

The efficacy of the model hinges on a fundamental principle: for a given question, a single answer should be associated, at most. Consequently, while multiple tuples may share the same answer, we aim to avoid identical tuples for distinct answers.

2.3 1st Scenario

Focusing on the patterns within the questions we compute the n-gram for n equal to 2, 3 and 4. Looking at the most frequent patterns in these 3 different settings it’s possible to classify a specific type of question which is included in the 1st scenario mentioned earlier: the multiple choice open-ended questions, characterized by the following patterns:

  • example
  • part of
  • instance of
  • can be.

The answers associated to this kind of questions are all equivalently right, then for each tuple (question, domain-relation) the model used in this case has to predict a subset of answers.

Figure 2: Wordclouds of 3-gram and 4-gram

2.4 2nd Scenario

Open-ended questions. The remaining tuples (question, domain-relation) of the open-ended questions, which belong to the 2nd Scenario, i.e. the generic ones, can be classified looking at the distribution of the related answers:

  • Unique.All the tuples which have associated only one answer. Assuming that, at this point, we have already discarded all the meaningless sentences, this whole set can be used in our predicting model.
  • Most frequent. In this set are included all the tuple which are associated to more than one answer but, one of them is more frequent than the others. Assuming that the most recurrent answer is the correct one, we keep for each tuple only the answer with the higher frequency.
  • Annoying. This tuples are quite tricky. For each one there are associated more than one answer but no one is most frequent than the others. This means that we cannot make any assumptions on the data: anyway, this data could be used in the Enriching part as shown below in the section 3.

Closed-ended questions. Analyzing the Y/N observations, we apply the same classification approach, keeping only the pairs in the Unique set (“Most frequent” results to be empty) and looking just at distribution of the total number of Yes or No, the affermative answers are twice the negative ones. Being more specific, the main concern about the quality of the data regards the answers’ distribution over the possible domain-relation pairs. Given a tuple (question, domain-relation) stored in the training set, the prediction could be affected by the distribution of the Y/N answers defined by the same domain-relation pair. So, looking at the distributions over all the possible domain-relation couples, we point out that the affermative answer of the association Geography and places — PLACE is over represented. Since we already have over than 330000 distinct observations for the closed-ended questions, we just dropout the half of these affermative answers in order to balance the dataset.

Figure 3: Y/N related to Domain-Relation pairs

3 Workflow

In this section, we elucidate the core workflow and elucidate how users engage with the bot. The user initiates the conversation and is presented with the choice of a domain from a set of five randomly offered topics. Subsequently, the user can pose questions pertinent to the chosen topic or respond to questions posed by the bot. Irrespective of the scenario, whether the bot provides answers or the user responds, our dialogue system prompts the user to verify the overall coherence of the preceding conversation. This approach serves to evaluate the quality of all generated conversations and to identify potentially nonsensical exchanges stemming from the system.

In the enriching aspect, we generate questions by extracting them from pairs previously labeled as “Annoying.” In such cases where assigning a singular, unique answer is challenging, we seek to enhance our data by leveraging user knowledge. We request users to provide a correct answer for the query, thereby refining the information.

Additionally, to ascertain the entities or concepts embedded in user answers, we rely on the methodology and findings derived from Babelfy — a graph-based technique for addressing Entity Linking and Word Sense Disambiguation [5]. By adopting this approach, we extract concepts from answers, facilitating their inclusion as observations in the Knowledge Base Server (KBS).

Figure 4: Workflow

4 Model

Double prediction. Our bot demonstrates impressive capabilities. Given a pair consisting of a topic and a question, it can aptly predict the relationship between the components. Furthermore, it accurately categorizes the type of question, whether it’s an open-ended multiple choice, open-ended generic, or closed-ended question. Subsequently, the bot predicts a specific answer. This prediction is facilitated by a simple yet effective recurrent neural network, known as Long Short-Term Memory (LSTM). LSTM architecture is renowned for its proficiency in sequence classification tasks.

Upon predicting the question type, the system proceeds to forecast the answer. This prediction hinges on the type of question that was anticipated. Notably, distinct neural networks were trained for each question type. For closed-questions, the initial prediction model was employed solely with Y/N pairs. In contrast, the models for the other two types both adopt a Sequence-to-Sequence (Seq2Seq) architecture, although each employs a unique approach.

This layered approach and tailored neural network selection enable our bot to navigate the complexities of various question types, providing accurate predictions and fostering an enhanced conversational experience.

4.1 1st Prediction and Closed-ended Questions

Predicting the relation-type. LSTM networks are a special kind of RNN, capable of learning long-term dependencies. In our case, the core of the model consists of an LSTM cell that processes one sentence at a time and computes probabilities of the possible pair (type of question, relation) related to the input. To represent the sentences in our NN architecture we apply the word embedding, which provides a dense representation of words and their relative meanings, mapping the words of each sentence: each word is represented as a 100-dimension vector. The possible output is composed by all the possible combinations given by the three types of questions described before and all the possible relations. Moreover, to enable the bot to recognize relation and type of question without knowing the concepts in the question, we use as dictionary of the trainset only the Top-50% of the words distribution over the data. The model trained only after 5 epochs reaches an accuracy of more than 98%.

Closed-questions model. The model is the same used before, but with the closed-ended questions and the accuracy reached after only few epochs is more than 96%.

4.2 Seq2Seq

The Seq2Seq model, widely used in Neural Machine Translation (NMT) frameworks, is characterized by combining two recurrent neural networks (RNN), which in our case are two LSTM. The first one, called encoder, encodes a sequence of characters or words into a fixed length vector representation, and the other, the decoder, decodes the representation into another sequence of characters or words. The encoder along with the decoder are jointly trained to maximize the conditional probability of a target sequence given a source sequence: applying this NMT approach to our model, we aim to ‘translate’ a sentence given by a specific (question, domain-relation) tuple into a target answer. We choose to use the LSTM because of his ability to store sequential information over extended time interval (Fig 5).

Figure 5: LSTM applied in a Seq2Seq Model

Multiple choice Model. In this model there are 3 main differences with the one applied for the generic open-ended questions:

  • the sentences are mapped as sequence of words
  • the encoder LSTM reads each input sequence in reverse in order to helps the learning algorithm to establish a connection between two sequences [3]
  • for each tuple (question, domain-relation) there is associated a set of answers which are all equivalently true, this means that we are predicting not a single answer, but an ID associated to a set of answers specific to this tuple. In other words, given a tuple of (topic, domain-relation, question) the Seq2Seq Model is used to predict a single word which stands for the ID of a specific set. The final answer is randomly selected from the predicted set and, the accuracy reached from the model after 30 epochs is more than 99%.

Generic Questions Model. Since it would be computationally expensive applying the embedding layer for all the possible words in the open-ended generic questions, our Seq2Seq Model is trained at character-level [4]. After more than 200 epochs, we evaluate the model on a sample of the train-data(10%), where the accuracy reached is over the 70%. The model could even perform better, if trained more time with a well-defined corpus.

5 Conclusions

We have developed a knowledge-bot, a dialogue system able to ask questions and predict answers on many topics, where both the tasks, querying and enreaching, have been thought exploiting as much as possible the structure of the metadata stored in the KBS. Unlike the conventional bot, our dialog system needs to improve the quality of his dataset: for this reason, a widespread use of it, through the enriching part, would improve the dataset as well as the quality of the prediction.

References

[1] J Camacho-Collados, R Navigli, BabelDomains: Large-Scale Domain Labeling of Lexical Resources

[2] https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

[3] I Sutskever, O Vinyals, QV Le, Sequence to Sequence Learning with Neural Networks

[4] blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

[5] A Moro, A Raganato, R Navigli, Entity Linking meets Word Sense Disambiguation: a Unified Approach

--

--

Francesco Fabbri

Ph.D. Student at UPF, Barcelona. Working on Fairness and Algorithmic Bias