Question Answering help desk applications and related tasks (part I)

Published in

Devjam

13 min readAug 21, 2021

Question Answering (QA) is an interesting natural language processing (henceforth, NLP) task. It is equally interesting from both a theoretical and a practical perspective. From a theoretical point of view (POV), it is interesting because it opens a wide area of interaction and cross-pollination between symbolic and numeric approaches to artificial intelligence (henceforth, AI). From a practical POV, it is interesting because of the many applications and use-cases that have a transformative potential to provide business value in a variety of domains, ranging from education and e-learning to business automation, health care, robotics, human-machine interaction, general AI, and many others.

In this blog, we will attempt to give a high-level overview of the QA-NLP task while providing pointers to (minimalistic) proof of concept (POC) practical applications. The main business case we will consider is that of a help desk app that will make information available to customers in an intuitive flow, starting from the semantic content of the users’ epistemic intents. We will also consider several related NLP tasks along the way.

This blog series comprises of three parts, and it will be structured as follows:

Part 1 presents the first stage of a QA system, namely, sparse information retrieval. In this stage, a (set of) relevant document(s) or smaller passage(s) is (are) retrieved from a larger body of knowledge.
Part 2 presents the second stage of a QA system, called dense paragraph reading (DPR). This step uses deep learning (DL) models based on attention mechanisms; we will focus on the transformers architecture.
Part 3 explores combinations between statistical, transformers-based, approaches to QA and symbolic, graph-based, approaches, which augment representational encodings with knowledge and relational information.

Without further ado, let us start with a simple example.

A simple starting example question

Let us consider a simple example question. Suppose that you want to find out How many languages does Bert understand? One way a human would do this is by opening the Wikipedia article about the topic and reading the text in order to find the needed piece of information. Another way is to just google it. Let us start with the first (we will come back to the second). The steps by which a human will achieve this task will probably involve opening a browser and reading the article, while keeping the initial question in mind. It can be useful, at this stage, to try to find out the answer on your own before going further.

Programmatically, we can do this in two steps. First, we can retrieve the text of the Wikipedia article, using the following snippet:

Get the text from the Wikipedia article about the BERT model

Second, once we have the text article, we can search for a shorter span inside it that contains the answer:

Get the span size answer to the question inside the Wikipedia text

At this stage, it can be useful to compare the answer you found by reading the article and what we obtained programmatically by running the QA-NLP task:

{'score': 0.0672, 'start': 2456, 'end': 2463, 'answer': 'over 70'}

When we run a QA task with our starting question and the Wikipedia article as inputs, the output answer is over 70; (the output also contains an answer score, ant the span’s position, we will return to the output arguments and their meaning later, in part two). The passage from which the answer span is extracted is:

On December 9, 2019, it was reported that BERT had been adopted by Google Search for over 70 languages.

Let us now also consider the second option; googling the same question gives us the following result (as of June 11, 2021):

Search result answer for the example question using the Google search interface — Search result answer for the example question using the **Google** search interface

The answer is different: 104 languages. If we inspect the featured snippet, we can also see that the passage inside which the answer is highlighted is also different. It references a blog on this very platform (Medium), that quotes the results of a research paper from arXiv. Both of these can be found online, and are included in the Git repo for reference (see the src/corpus folder).

Let us try to understand why the answers are different.

Before going to the next section, it may be interesting to jump to the companion demo app and get creative yourself in experimenting with various combinations of text articles and questions to get a feeling for how things look on a very concrete level. Inside the app, you can edit both the paragraph of text that constitutes the body of knowledge, and customize the specific question you want to answer to match your own topics of interest. We will return to this companion app for more examples and illustrations as we progress through the coming sections.

Information retrieval or syntax-based sparse search

Solving a QA task is, generally, done in two stages. In the first step, a (list of) document(s) is retrieved from a corpus of general background knowledge.

The second step is a dense paragraph search, which identifies the shorter, span-size, answer for the question (we will return to analyze this second step in detail later, in part two).

Let us focus on the first step: retrieving relevant documents or passages from a corpus containing a body of knowledge. This body of knowledge usually contains a very large number of documents; for example, when googling, this will be a very large number of web pages. In a concrete business application, a smaller scale, but still large, number of documents containing help desk information, etc. (in our examples, this number is minimally small, just to illustrate the main points). This sparse search stage can be performed using a syntactic or a semantic variant (or combinations of both). When a syntax-based method is used, key terms from the query are matched with terms contained in the documents to determine the best candidates for further processing. When a semantic-based method is used, the words are first encoded as vectors, and a measure of similarity in the emerging vector space is used to identify the documents that are most conceptually similar to the intent of the question. Syntax-based techniques are generally more efficient; semantic-based ones are more accurate. There is always a trade-off between these two aspects when deciding which approach to use.

Several syntax-based metrics can be used to decide which document is the most similar to the question. Term frequency inverse document frequency (henceforth, TFIDF) is a frequently used metric for this purpose. Formally, it is defined by:

Definition of term frequency inverse document frequency (TFIDF) metrics — Definition of term frequency inverse document frequency (**TFIDF**) metrics

There are many variations in defining what the first term (TF) of the definition means, and, depending on the concrete application context, there are pros and cons for each of them. A frequently used quantity is the row count of the term’s occurrences, and alternatives include various ways of normalizing the row count. The second term (IDF) also has many smoothen-ed and weight-ed variations, which fit better or worse in various application contexts. Intuitively, the combined TFIDF value will filter out documents in which terms with a meaning related to the query appear less frequently, weighted by the document frequency of related terms in the entire collection of documents, as these are less likely to contain relevant information for the question to be answered.

Many machine learning (ML) libraries provide functionality for computing TFIDF; here is how that can be done using sklearn:

TFIDF computation for similarity to the question using sklearn vectorizer

Continuing with our example, we will apply the TFIDF measurement to compare the two text articles inside a (micro) corpus. In applications that go beyond this POC illustration, this step usually includes setting up a large corpus of documents for the knowledge base, indexing the terms and/or their vectorized representations, assigning additional weight to the terms present in the query, iterating the stages of relevance filtering using different measures, and various other optimizations.

To illustrate this for our simple example, we will use a data layer for the knowledge corpus managed by an Elasticsearch (ES) cluster. There are various providers of hosted ES services that can be easily integrated within an app architecture; Bonsai and Searchly are two popular examples, alongside all other major cloud providers. We will continue our POC illustration with a (micro)corpus that uses the first option (the other ones are very similar). Here are the main steps: sign-up for Bonsai using the hobby tier (free), pick a region close to your location, and choose a name for your cluster. This will provision a cluster with the community edition of Elasticsearch 7.10.2 (June 2021) and tier limits (125Mb data, 35k docs) that will be more than enough for this proof of concept (POC). In your bonsai cluster dashboard, go to Access > Credentials and copy the provided URL. Save the URL in your own .env file (see the .env.example file in the repo) and go to theKibana console in your bonsai dashboard (this will open in a new tab; choose explore my own when asked for data, and then open the Dev Tools section, under the Management section in the main burger menu). Running the command in the following line should confirm the index creation (in the JSON message that follows):

PUT bert
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "bert"
}

Now we can populate the Elasticsearch index with the documents in our mini-corpus:

Populate the Elasticsearch Bonsai corpus with four sample documents about Bert

If everything goes well, and the index is successfully populated, you should see a message similar to the following one, confirming that you have now an index with four documents inside:

Populating the corpus ...
Elasticsearch SUCCESS: (4, [])

You can also confirm the existence of the bert index by running the following command in the Kibana console (available from the Bonsai dashboard):

GET /_cat/indices/
green open .kibana_1 Z9gs-cXCRYu05DhZ1N2oig 1 1 30 10 129.1kb 64.5kb
green open bert      Xtd399dmTq2d0II6ZEHLOw 1 1  4  0  28.4kb 14.2kb

The output will also indicate that there are four documents in the index (the seventh column is the number of docs).

We are now ready to check which passage is the most similar to our starting question using an more_like_this Elasticsearch query type, as follows:

Elasticsearch more like this query for question similarity within the documents corpus

At this stage, it might be a useful exercise to experiment with the newly created index and try to come up with a query for which the wiki_bert document is the most relevant. The companion demo app also provides a GUI for this purpose. Once you have found such a query, you can also explore what happens when a new document is added to the corpus, and how its relation to the terms in the query, and their meaning, influence its relative relevance score. You will also find such an example in the companion app, but it is probably even more important to customize it using your own topic(s) of epistemic interest.

Semantic retrieval or vector similarity search

In our use-case of a help-desk app that can provide answers to users’ questions, it is often the case that the query formulated by the customer does not contain the exact syntactic keywords in the documents that contain the knowledge base. More often, users will formulate a query that expresses an intent that emerges from a concrete problem they face and contains a semantically relevant utterance, but not necessarily the exact keywords. The knowledge base can contain technical keywords, while the users’ questions are formulated mainly in nontechnical natural language. In such cases, semantic search can outperform syntax-based solutions because the answer relevance is based not on keyword occurrence but on a similarity metric inside a vector space semantic representation.

The first step toward a semantic solution is to enrich the documents with an embedding field, which will contain a vectorized representation of the text content. This vector will be then used to compute a similarity metric between the documents and the embedding of the question. In the following snippet, the universal sentence encoding is imported from tf-hub and is used to vectorize the documents.

Using the tf-hub universal sentence encoder to add vector embeddings to documents

Once the documents’ vector embeddings have been added to the corpus, their semantic similarity with the query can capture relevance beyond the simple presence of the same syntactic key-term. The level of granularity at which the vector similarity is applied can be different, ranging from word meanings to the span, sentence, or even paragraph level, and these will produce various measures of relevance, depending on the concrete application needs. Formally, the cosine similarity of two vectors is defined at a general level by:

Definition of semantic similarity using cosine metrics between vector embeddings

Many ML libraries contain functionality for computing cosine similarity. The following snippet illustrates how semantic relevance based on vector similarity can be computed with numpy functions:

Computing cosine similarity between query and document vectors using functions from the numpy library

The enterprise edition of ES supports cosine similarity script queries on dense_vector field types defined in the index mappings. However, this is not supported in the open distro of elasticsearch (ODES), which is the version available in Bonsai (also Searchly and AWS). The community edition of ES supports similarity search based on the approximate k-nearest neighbours algorithm on knn_vector field types defined in the index mappings. Nevertheless, this also requires a plugin installation. It is beyond the scope of this blog post to go into details about these implementations; the following snippet only illustrates the approach, in a minimalistic setting. Useful pointers to further information are these blog posts and the ODES documentation.

Vector embeddings similarity search between question and documents using elasticsearch queries

In order to run queries returning the k-nearest neighbours over the knn-vector type a running open distro ES cluster is needed. This can be run locally in a docker container, starting from the OSS ES docker image, as follows:

docker pull amazon/opendistro-for-elasticsearch:1.13.2docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" amazon/opendistro-for-elasticsearch:1.13.2ELASTICSEARCH_LOCAL_OPEN_SOURCE=https://admin:admin@0.0.0.0:9200curl -XGET https://localhost:9200 -u 'admin:admin' --insecure

In order to run queries containing cosine similarity over the dense-vector type a running enterprise version ES cluster is needed. This can be run in a docker container, starting from the enterprice ES docker image, as follows:

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.13.3docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.13.3export ELASTICSEARCH_LOCAL_ENTERPRISE=0.0.0.0:9200

A useful open-source alternative that deserves to be mentioned here, and provides built-in vector similarity search in the data layer is Milvus DB. Running a Milvus server locally is also possible using a Docker, however, this is not only a one liner, but requires orchestrating three containers. The companion repo contains an example docker-compose.yml file, further details can be found in the official Milvus documentation. The following snippet illustrates, in a minimalistic setting, how question-corpus relevance can be computed using vector similarity inside a collection of document embeddings:

Computing question relevance in a minimalistic corpus using vector similarity inside a Milvus collection

Another advantage of using embeddings and semantic search based on vector similarity is the fact that this also applies in other applications, beyond the text-based format, for example, recommender systems for music use similarity between vector embeddings of audio files. This can also be generalized for search tasks that bridge multimodal file formats: text and images, or text and audio or video formats; it can also bridge natural and formal languages, for example by connecting a natural language description with the most relevant code snippet, inside a large corpus of code repositories, like Gitlab Copilot does. These applications are also beyond the scope of this blog post, a very useful pointer for further information on working with multi-modal neural search pipelines is the jina.ai library.

Further steps: dense paragraph reading & infrastructure deployment options

Up until now, we have focused only on the first step in a QA task. The second step will perform a dense paragraph reading inside the most relevant document paragraph(s) to identify the shorter, span-sized, answer to the question. This stage can use various deep learning (DL) modelling architectures. A very successful approach for solving NLP tasks, in general, and, in particular, question answering, using ML and DL is the modelling architecture based on transformers.

We have already used an inference endpoint based on this architecture in the second snippet of this blog:

api-inference.huggingface.co/models/deepset/roberta-base-squad2

If we look at the URL used, we can see that this was a roberta model trained with a squad2 dataset, developed by deepset, and hosted on huggingface infrastructure. This is a state of the art (henceforth, SOTA) model, however, because it uses a complex DL architecture, it is not efficient in practice to run it on all document paragraphs inside a large corpus of background knowledge. Doing this for more than one document or paragraph is theoretically possible, and can be practically useful, however, the tradeoff here is that, from a pragmatic perspective, it would also make the waiting time for receiving answers longer. As we have seen in the minimalistic examples considered in this blogpost, the accuracy of a transformer prediction depends on preselecting the most relevant document(s) or paragraph(s), before the dense paragraph reading stage is performed.

In the second part of this blog, we will go into details regarding the prerequisites for such an inference endpoint for QA: pretraining a transformer model, fine tuning a pretrained transformer model for the QA task, creating, using, and evaluating a QA-specific dataset, monitoring predictions and iterating to improve them, etc. Before that, let us briefly point out the resources needed to develop, and deploy, a complete QA solution:

a cluster to manage and host the knowledge corpus (e.g., ES on Bonsai, ES on Searchly, ES on AWS, hosted Milvus DB docker container, etc.)
a docker container hub to deploy and host the trained ML model for the task at hand (e.g., SageMaker JumpStart, Azure ML inference, etc.)
an API endpoint to host the business logic interfacing the deployed QA models (e.g., Qovery app, Deta micro, AWS Lambda, Azure functions, etc.)
a graphical, or voice, UI to allow the users to input queries (and other intents) and receive the output answers (e.g., Streamlit share, a web chatbot, an Alexa skill, etc.)
to make this architecture concrete, try the companion demo app in a browser here; the code repository of the demo app here; and the mini-corpus data and Bonsai cluster scripts here;

Graphical user interface for running a question answering task for a given paragraph and query — Graphical user interface for running a question answering task for a given **paragraph** and **query**

References

Jina AI is a great open-source information retrieval and multi-modal neural search deployment pipeline library.
Haystack is a great semantic QA open-source framework linked with a commercial offering for QA inference deployment hub.
Bonsai and Searchly are ES hosting providers that have generous free tiers.
Milvus DB is a great open source vector database with embedded vector similarity search.
Huggingface is a great open-source library for virtually any NLP task (including QA) using the transformers architecture. Huggingface-hub is a great resource for NLP models.
Rasa is a great open-source library for developing conversational agents and training related NLP models.

I work at https://sytac.io  We are a consulting company in the Netherlands, employing around ~120 developers across the country at A-grade companies like KPN, KLM, ING, ABN-AMRO, Ahold Delhaize, and KPMG. We run this medium space: DevJam, check it out and subscribe if you want to read more stories like this one. Alternatively, take a look at our job offers if you are seeking to bring your dev. career to the next level 💪