Virtual Assistant ‘Chatbot’ based on Open Domain Question Answering using Haystack Framework.

Zeeshan Haque — Fri, 24 Dec 2021 10:48:23 GMT

Introduction

Chatbots or virtual assistants have been around for a while, but with the upcoming of Language pre-trained models like BERT, RoBERTa, the field of Open Domain Question Answering (ODQA) has evolved exponentially over the years. The aim of chatbots is to enhance the digital transformation of companies. Alcatel-Lucent Enterprise (ALE), a French business communication enterprise providing cloud-based collaboration application (Rainbow) and hybrid communication systems (Omni PCX Enterprise OXE), implements chatbots to automate certain processes related to OXE system maintenance. More precisely, ALE operates a business website that provides frontline workers with a bunch of technical documents to ease troubleshoot the OXE systems in case of any technical blocks. The technician can also have support from the customer care team for resolving the technical query. The aim of Virtual Assistant ‘Chatbot’ is to minimize the time for finding the right solution to the technical problem of OXE. It is dedicated to professionals seeking technical help for troubleshooting OXE.

The state-of-the-art of ‘Virtual assistant’ chatbot is based on Open Domain Question Answering. So, in pursuit of working on Open Domain Question Answering (ODQA), we found this uber cool framework called Haystack (https://haystack.deepset.ai/) which is developed by Berlin-based Deepset AI. Our utilization of the Haystack Framework revolved around its three main components: Storage, Reader, and Retriever.

The task of accurate response to a query is a combination of a Retriever (finding the relevant passage within the collection of documents) and Reader (finding the relevant text span within the selected passage provided by retriever). The Haystack framework provides various types of Retrievers and supports many Language models (Readers). The Retriever uses algorithms like TF-IDF or BM25, custom Elasticsearch queries, and embedding-based approach to find candidate paragraphs, while the Reader uses pre-trained HuggingFace models like BERT, RoBERTa to find the relevant answer within the paragraph. Our contribution in developing ODQA using Haystack framework by Deepset AI on certain technical documents are data preprocessing of the documents, annotation of the documents to prepare a set of questions with short and long answers, and also finding a suitable combination of Retriever-Reader as well as optimal parameters (top-k Retriever and top-k Reader) based on both performance and time of computation.

Haystack Reader Finetune

Like all Machine Learning pre-trained models, the Haystack Reader can be also finetuned in order to ameliorate the performance. Before fine-tuning the pre-trained models for the Reader, we have two preliminary steps: data pre-processing and data annotation.

Data pre-processing

About the dataset of the ALE project, we have several PDF documents that discuss the technical specification of a product. Before utilizing the Haystack framework, we need to do some data pre-processing to enhance the performance of the Retriever and Reader. Following are the steps we implemented in the data preprocessing:

The PDFs need to be converted to .txt files. One can use any online tool or even Python package like pdf2miner. It is also possible to check out the pdf converter included in Haystack too.
All text files need to be cleaned. It consists of removing the content page, headers and footers, section, and subsection numbers.
All tables need to be converted to sentences, for example, a particular row in a table having several columns can be transformed into a sentence using simple English. The conversation process has to be applied also for images and diagrams.
Irrelevant special characters and white space need also to be removed.
In the case of a large text file, one can split it into several parts. This will help in reducing time computation and improving performance.

Data Annotation

With the extensive amount of data preprocessing work, we proceed to the annotation step; Annotation is one of the key components of the state-of-the-art of Open Domain Question Answering. It involves the creation of question-answer pairs for fine-tuning pre-trained HuggingFace models.

Over the phase of “annotation”, we understood more about our dataset (e.g. what kind of additional data preprocessing we may need, what kinds of questions one can prepare to challenge the ODQA pipeline, etc).

Deepset AI provides a web-based annotation tool (https://haystack.deepset.ai/guides/annotation) to label the data. The tool supports structuring the workflow with organizations, projects, and users. The labels can be exported in SQuAD format that is compatible with Haystack training.

For our ALE project, we have made certain kinds of guidelines in framing Questions and tagging their respective answers in the Annotation web interface by Deepset AI. The guidelines are the following:

First and foremost, the questions we are framing need to be specific: the more specific the question, the better the performance of the Retriever-Reader.
Associate one or several questions to one answer. These questions must be composed of different words having the same meaning.
The above method can be implemented for different paragraphs having a similar context

Annotation example 1:

For Training purposes -

Q1_1) What is meant by Document Store?

Q1_2) What is the meaning of Document Store?

For Evaluation -

Q1_3) What is the definition of Document Store?

Annotation example 2:

For Training purpose -

Q2_1) What is meant by F1 Score?

Q2_2) What is the definition of F1 Score?

For Evaluation -

Q2_3) What is the meaning of F1 Score?

In the above two examples, the questions for training and evaluation purposes contain different words (‘meaning’, ‘definition’ and ‘meant’), but have the same meaning. And we replicate these question framing in a similar manner for preparing our training dataset and evaluation dataset.

The annotation phase has ended with 1246 Question-Answer pairs for the training purpose and 522 Question-Answer for Evaluation Purpose. Once the annotation process is finished, the obtained data will be used for fine-tuning the Reader pre-trained models.

The Choice of best Retriever-Reader combination

Based on the criteria of the most downloaded pre-trained model for Question Answering, the pre-trained models we selected from the HuggingFace platform are the following –

1. deepset/RoBERTa-base-squad2

2. distilbert-base-cased-distilled-squad

3. minilm-uncased-squad2

4. deepset/bert-large-uncased-whole-word-masking-finetuned-squad

5. bert-large-uncased-whole-word-masking-squad2

6. bert-large-cased-whole-word-masking-finetuned-squad

Haystack proposes also different kinds of Retriever. We distinguish Dense Retriever such as Passage Retriever and Embedded Retriever and Sparse Retriever like TF-IDF and Elastic Search. To sum up, there are almost 24 possible Retriever-Reader combinations. As the task of choosing the best combination is crucial, we decided to make an evaluation study to pick out the best combination in the context of ALE troubleshooting documents. Among all these combinations, we only evaluate the following ones.

BERT Large Uncased + ElasticSearch Retriever (ES)
Deepset BERT Large + ElasticSearch Retriever (ES)
BERT Large cased + ElasticSearch Retriever (ES)
RoBERTa Base squad2 + ElasticSearch Retriever (ES)
miniLM uncased + ElasticSearch Retriever (ES)

For a Reader-Retriever pipeline, when a query arrives at the Haystack Framework, the retriever finds relevant paragraphs within the collection of documents. One needs to specify the parameter top-k Retriever that indicates the number of paragraphs that need to be picked and moved to the Reader. The reader selects then the text span from the given set of paragraphs as the answer. The number of top-ranked answers returned by the Reader is denoted by the parameter top-k Reader. Based on these top-k parameters, one can compute the number of missed out answers by both Retriever and Reader as an evaluation criterion. Therefore, on basis of lesser answer-miss out criteria, the above-mentioned five combinations of Reader-Retriever are selected, the rest is then not included for the further evaluation study.

Evaluation Study

Evaluation criteria

Speaking about the evaluation process, the primary metric is Answer Missed-Out, which is defined as the number of questions that couldn’t be mapped to their corresponding answer during the evaluation process. Followingly, we have Accuracy which predicts the answer’s text span included in the actual answer text), Exact Match predicts an answer that is equivalent to the actual answer and lastly, we have an F1 score which measures the average overlap between the prediction and actual answer.

Evaluation results

Right below is a summary of the evaluation of 5 pre-trained models with ElasticSearch Retrievers, these were evaluated on separate 522 Question-Answers having top-k Retriever equals to 10 and top-k Reader Reader equals to 5. The below table compares different combinations of Reader-Retriever in the context of ALE Question-Answering troubleshooting.

Table 1: Evaluation summary for Fine-Tuned Reader Models with ElasticSearch Retriever

From Table 1, we can deduce that the combination of BERT Large Uncased with Elasticsearch is performing best with respect to its counterparts in top-1 and top-k evaluations, especially in terms of an exact match. In the context of the ALE dataset, most of the answers are rather long. As we aim to achieve excellent experience in providing troubleshooting assistance, we have chosen Exact match as the most important evaluation metrics and benchmarked the performance of Retriever-Reader.

Considering the pair of Bert Large Uncased and Elasticsearch, let’s see the evaluation performance by varying the top-k parameters.

Table 2: Effect of different top-k parameters on answer-miss out by BERT Large Uncased + ElasticSearch Retriever

From Table 2, we can observe that minimizing top-k Reader from 5 to 3 and keeping top-k Retriever equals 10 for the first two lines increase answer miss out. If we decrease top-k Retriever from 10 to 5 and top-k Reader from 5 to 3, we can remark that there is a significant amount of answers missed out by both Reader and Retriever. This definitely impacts the evaluation metrics presented in Table 3. We can also highlight that the number of passage candidates given by the Retriever and available for the Reader to select the right answer also affects the evaluation metrics as shown in Table3.

Table 3: Evaluation summary for BERT Large Uncased + ElasticSearch Retriever w.r.t different top-k parameters

From Table 3, we can remark that BERT Large Uncased with Elasticsearch with top-k Retriever equals 10 and top-k Reader equals 5 has the edge over lesser top-k-parameter in terms of evaluation metrics, but this configuration comes with a cost of high computation time. When one is working in a constrained infrastructure, top-k parameters can be decreased for medium or lower pre-trained models like RoBERTa-base-squad2 and miniLM but the evaluation metrics could be non-convincing. Therefore, we need to make a tradeoff between evaluation and computation depending on our infrastructure.

Conclusion

Virtual assistant Chatbot based on Open Domain Question Answering utilizing Haystack Framework is a powerful solution enabling professionals to seek technical help on ALE’s Omni PCX Enterprise business website. Our experimentation achieved its pinnacle of success when our paper was published in KES International Journal in the month of September 2021 (https://www.sciencedirect.com/science/article/pii/S1877050921015854).

The efficiency of ODQA using Haystack Framework depends on what types of Retriever-Reader are employed, fine-tuning with the guidelines of question format and also the top-k parameters for both Reader and Retriever.

Acknowledgment