Semantic Embeddings for Question Answering in MindMeld

Bring the power of state-of-the-art models like BERT to answer questions in your conversational interfaces

Arushi Raghuvanshi
MindMeld Blog
5 min readSep 21, 2020

--

Image Credit: V. Konovalov

There are many ways of saying the same thing. In some cases, language variation may be reordered words, pluralization, misspellings, or some minor alterations to the text. In other cases, the text looks completely different but means the same thing.

Consider the following examples:

  • spaghetti marinara” vs “pasta with tomato sauce
  • attendees” vs “everyone in the meeting
  • There are many ways of saying the same thing.” vs “You can provide identical information in multiple different ways.

Until recently, the MindMeld question answering capabilities relied purely on text-based retrieval. You could capture language variation by explicitly uploading synonyms, but that requires data collection resources — without which, the system can appear unintelligent.

Deep learning-based dense embeddings (character, word, or sentence) can capture semantic information. Pre-trained or fine-tuned embeddings can be used to find the best match in the knowledge base even if the exact search token isn’t present in the uploaded data.

In this post, we describe how you can now leverage these semantic embeddings for question answering in MindMeld.

Preparing your knowledge base

To leverage semantic embeddings in search, the first step is to generate the embeddings for your desired knowledge base fields. You can use one of the provided embedders or use your own. If your app mainly consists of standard English vocabulary, one of the provided embedders may work well enough. But if the text you are searching against has quite a bit of domain-specific vocabulary, you may benefit from training or fine-tuning your own embedder on your data.

The provided embedders are:

  1. Sentence transformers, which are pre-trained multilingual sentence embeddings using BERT, RoBERTa, XLM-RoBERTa, etc.
  2. GloVe, which generates global vectors for word representations

BERT-based models have skyrocketed in popularity and are now used to solve various NLP tasks. These models provide good context-dependent representations of word-level semantics, but they don’t produce very semantically meaningful sentence or paragraph level embeddings out of the box. The sentence transformer package fine-tunes BERT, RoBERTa, DistilBERT, ALBERT, and XLNet with a siamese or triplet network structure. It’s trained with the objective that two sentences with similar semantics should generate embeddings that are close in cosine similarity. We found that by using these embeddings and cosine similarity scoring for ranking, we achieved good results on our question answering datasets that outperformed the previous approach of scoring based on text-relevance only.

GloVe embeddings have remained a popular baseline for semantic word representations. For paragraph and sentence-level embeddings, we average the word vectors of the passage. While this performs reasonably well for smaller phrases or sentences, it breaks down for large documents. We recommend trying out GloVe embeddings only when searching against shorter words or phrases.

If you would like to build your own embedder, you can import the abstract Embedder class from MindMeld and define the load and encode methods as follows:

from mindmeld.models import Embedder, register_embedder
class MyEmbedder(Embedder): def load(self, **kwargs):
# return the loaded model
def encode(self, text_list):
# encodes each query in the list
register_embedder('my_embedder_name', MyEmbedder)

Once you’ve decided which embedder you’d like to use, you can specify the details in your application’s config file. The following configuration is an example of using the default BERT embedder type and generating embeddings for the question and answer fields of the FAQ data index:

QUESTION_ANSWERER_CONFIG = { 
"model_type": "embedder",
"model_settings": {
"embedder_type": "bert",
"embedding_fields": {
"faq_data": ["question", "answer"]
}
}
}

To load the data into your knowledge base and prepare the optimized index, you simply have to run the load-kb command, providing your app name, index name, and the path to your data file:

mindmeld load-kb hr_assistant faq_data hr_faq_data.json --app-path .

Querying via the question answerer

Once your knowledge base has been created, MindMeld has a simple, easy-to-use API that allows you to query it. Behind the scenes, the query takes advantage of advanced information retrieval techniques to generate a ranked list of results that best match the search term.

Different query types are available, which are optimized for different use cases. The three query types that leverage the semantic embedding signal are embedder, embedder_keyword, and embedder_text, described in more detail in the table below. Using one of these query types will automatically find results for which the embedded search fields are close in cosine similarity to the embedded query.

Altogether, a query to search against questions in your index would look like this:

answers = qa.get(index=index_name, query_type='embedder',    
question=query)

You can also search against multiple fields, and use a combination of embedder and text-based search:

answers = qa.get(index=index_name, query_type='embedder_text', 
question=query,
answer=query)

Evaluation of the semantic embedder

We evaluated the semantic embedder on an internal dataset. Leveraging the embedding signal, in addition to text signals, improved the accuracy of the top returned result by 7.1%.

We also performed a qualitative evaluation on multiple datasets in different domains. Let’s consider some examples from a Human Resources Assistant. The code for this application can be downloaded via the command below, and the documentation can be found here.

mindmeld blueprint hr_assistant 

MindMeld’s question answering capabilities are heavily used in the FAQ functionality within the HR Assistant. In the following queries, we match the user’s text against the FAQ documents’ question fields in the knowledge base, using different query types. In the real application, we would display the answer field of the top document to the user. Here, we are printing the question field of the top document for easier comparison.

In the example below, the embedder signal provides an improvement over text signals. You can see that less frequent terms like don’t may overpower the ranking when relying solely on text relevance, causing the system to return an off-topic question. With the embedder query type, the system is able to better leverage semantics and retrieve the correct document as the top result:

>>> answers = qa.get(index='faq_data', question="don't have my w2",   
query_type='keyword')
>>> print(answers[0]['question'])
"What if employees meet goals, but don't do well in other responsibilities?">>> answers = qa.get(index='faq_data', question="don't have my w2",
query_type='embedder')
>>> print(answers[0]['question'])
'What if I did not receive a form W2?'

In the next example, we need a combination of text and embedder signals to get the correct response. With only text signals it appears that the phrase copy of contributed to the mis-ranking. Using only the embedder signal, the result was related to the correct topic of “pay information,” but it still wasn’t the best match. A combination of text and embedder signals was able to retrieve the best result:

>>> answers = qa.get(index='faq_data',
question="hard copy of my pay statement",
query_type='keyword')
>>> print(answers[0]['question'])
'Should we charge the employee for a copy of the personnel file?'>>> answers = qa.get(index='faq_data',
question="hard copy of my pay statement",
query_type='embedder')
>>> print(answers[0]['question'])
'How do I find out my pay information?'>>> answers = qa.get(index='faq_data',
question="hard copy of my pay statement",
query_type='embedder_text')
>>> print(answers[0]['question'])
'How do I print my online Pay Statement?'

For more details, see our documentation. For more background on Question Answering for Conversational Interfaces see this post.

We welcome every active contribution to our platform. Check us out on GitHub, and send us any questions or suggestions at mindmeld@cisco.com.

--

--