Enhancing FAQ Search Engines: Harnessing the Power of KNN in Elasticsearch

Satish Silveri
Nerd For Tech
Published in
8 min readOct 20, 2023
source: https://shorturl.at/eDJP3

In an era where quick and accurate information retrieval is paramount, developing robust search engines is of utmost importance. With the advent of Large Language Models and Information Retrieval architecture like RAG, leveraging text representations(vectors/embeddings) and vector databases in modern software systems has gained much popularity. In this article, we delve into the details of how to utilize Elasticsearch’s K-nearest neighbours (KNN) search and text embeddings from powerful language models, a potent combination that promises to revolutionize the way we access frequently asked questions (FAQs). Through a comprehensive exploration of Elasticsearch’s KNN capabilities, we’ll uncover how this integration enables us to create a cutting-edge FAQ search engine that enhances the user experience by understanding the semantic context of queries with lightning-fast latency.

Before we begin designing our solution, let’s understand a few basic concepts in information retrieval systems.

Text Representation (Embeddings)

The best way to describe an embedding is explained in this article.

An embedding is a numerical representation of a piece of information, for example, text, documents, images, audio, etc. The representation captures the semantic meaning of what is being embedded, making it robust for many industry applications.

Semantic Search

Traditional search systems use lexical matching to retrieve documents for a given query. Semantic search aims to understand the context of the query using text representations(embeddings) to improve search accuracy.

Types of Semantic Search

Symmetric Semantic Search: A search use case where the query and the search text are of similar length. For e.g. finding similar questions in the dataset.

Asymmetric Semantic Search: A search use case where the query and the search text is of different length. For e.g. finding relevant passage(s) for a given query.

Vector Search Engine (Vector Database)

source: http://surl.li/lccjl

Vector search engines are specialized databases that can be used to store unstructured information such as images, text, audio or video as embeddings or vectors. We will be using Elasticsearch’s vector search capabilities for this article.

Now that we understand the search system's building blocks let’s dive into the solution architecture and implementation.

Solution Architecture

source: http://surl.li/lccjl
  1. The first step for the search solution would be to index the Question-Answer pairs into Elasticsearch. We will create one index and store both question and answer embeddings in the same index. We will be using 2 separate models for embedding questions and answers based on the characteristics of the retrieval.
  2. We will embed the query with the same models used in step 1 and form a search query(3 parts viz. question, answer, lexical search), mapping query embeddings to respective question and answer embeddings.
  3. We will also provide a boost value to each part of the query to denote their importance in the combination. The final results returned are ranked based on the summation of scores multiplied by the respective boost value.

Environment Setup

  1. To install Elasticsearch using docker, refer to this detailed article on how to set up a single node cluster. If you already have a cluster, skip this step.
  2. Set up your index. You can use the following mapping as a starting point.
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"Question": { "type": "text" },
"Answer": { "type": "text" },
"question_emb": {
"type": "dense_vector",
"dims": 768,
"index": true,
"similarity": "dot_product"
},
"answer_emb": {
"type": "dense_vector",
"dims": 1024,
"index": true,
"similarity": "dot_product"
}
}
}
}

Model Selection

As we are dealing with data with a fairly common language, for the sake of this experiment, I have selected the top-performing models from the MTEB leaderboard from the Retrieval (for answers) and STS (for questions) sections.

Selected Models:

  1. For Answers: BAAI/bge-large-en-v1.5 (you can use the quantized version for faster inference)
  2. For Questions: thenlper/gte-base

If you have domain-specific FAQs and want to check which model performs the best, you can use Beir. Check out this section that describes how to load your custom dataset for evaluation.

I have written an article on how to choose the right SBERT model for your data using BEIR tool. Check it out!

Implementation

For the purpose of this experiment, I am going to use a Mental Health FAQ dataset from Kaggle.

  1. Load the dataset
import pandas as pd
data = pd.read_csv('Mental_Health_FAQ.csv')

2. Generate Embeddings

Questions:

from sentence_transformers import SentenceTransformer
question_emb_model = SentenceTransformer('thenlper/gte-base')

data['question_emb'] = data['Questions'].apply(lambda x: question_emb_model.encode(x, normalize_embeddings=True))

Note:

We normalize the embeddings to use dot product as a similarity measure instead of cosine similarity. The computation is faster and is recommended in Elasticsearch dense vector field documentation.

Answers:

answer_emb_model = SentenceTransformer('BAAI/bge-large-en-v1.5')

data['answer_emb'] = data['Answers'].apply(lambda x: answer_emb_model.encode(x, normalize_embeddings=True))

3. Index Documents

We are going to use Elasticsearch helper functions. Specifically, we are going to use streaming_bulk API to index our documents.

First, let’s instantiate the elasticsearch python client.

from elasticsearch import Elasticsearch

from ssl import create_default_context

context = create_default_context(cafile=r"path\to\certs\http_ca.crt")
es = Elasticsearch('https://localhost:9200',
http_auth=('elastic', 'elastic_generated_password'),
ssl_context=context,
)

Next, we need to create a document generator that can be fed into the streaming bulk API.

index_name="faq-index"
def generate_docs():
for index, row in data.iterrows():
doc = {
"_index": index_name,
"_source": {
"faq_id":row['Question_ID'],
"question":row['Questions'],
"answer":row['Answers'],
"question_emb": row['question_emb'],
"answer_emb": row['answer_emb']
},
}

yield doc

Finally, we can index the documents.

import tqdm
from elasticsearch.helpers import streaming_bulk
number_of_docs=len(data)
progress = tqdm.tqdm(unit="docs", total=number_of_docs)
successes = 0
for ok, action in streaming_bulk(client=es, index=index_name, actions=generate_docs()):
progress.update(1)
successes += ok

print("Indexed %d/%d documents" % (successes, number_of_docs))

4. Query Documents

def faq_search(query="", k=10, num_candidates=10):

if query is not None and len(query) == 0:
print('Query cannot be empty')
return None
else:
query_question_emb = question_emb_model.encode(query, normalize_embeddings=True)

instruction="Represent this sentence for searching relevant passages: "

query_answer_emb = answer_emb_model.encode(instruction + query, normalize_embeddings=True)

payload = {
"query": {
"match": {
"title": {
"query": query,
"boost": 0.2
}
}
},
"knn": [ {
"field": "question_emb",
"query_vector": query_question_emb,
"k": k,
"num_candidates": num_candidates,
"boost": 0.3
},
{
"field": "answer_emb",
"query_vector": query_answer_emb,
"k": k,
"num_candidates": num_candidates,
"boost": 0.5
}],
"size": 10,
"_source":["faq_id","question", "answer"]
}

response = es.search(index=index_name, body=payload)['hits']['hits']

return response

Note:

As instructed on the model page, we need to append instructions to the query before we convert it to embeddings. Also, we are using v1.5 of the model as it has a better similarity distribution. Check the FAQs on the model page for more details.

Evaluation

To understand if the proposed methodology works, evaluating it against the traditional KNN search system is important. Let's try to define both systems and evaluate the proposed system.

System 1: Asymmetric KNN search (Query and Answer vectors).

System 2: Combination of Query(BM25), Asymmetric KNN search(Query and Answer vectors) and Symmetric KNN search(Query and Question vectors).

To evaluate the system, we must mimic how a user would consume the search. In simpler words, we need to generate paraphrased questions from the source questions that would be similar to the complexity of the question. We will be using t5-small-finetuned-quora-for-paraphrasing fine-tuned model to paraphrase the questions.

Let’s define a function that can generate paraphrased questions.

from transformers import AutoModelWithLMHead, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-small-finetuned-quora-for-paraphrasing")
model = AutoModelWithLMHead.from_pretrained("mrm8488/t5-small-finetuned-quora-for-paraphrasing")

def paraphrase(question, number_of_questions=3, max_length=128):
input_ids = tokenizer.encode(question, return_tensors="pt", add_special_tokens=True)

generated_ids = model.generate(input_ids=input_ids, num_return_sequences=number_of_questions, num_beams=5, max_length=max_length, no_repeat_ngram_size=2, repetition_penalty=3.5, length_penalty=1.0, early_stopping=True)

preds = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in generated_ids]

return preds

Now that we have our paraphrase function ready, let’s create an evaluation dataset that we will use to measure systems’ accuracy.

temp_data = data[['Question_ID','Questions']]

eval_data = []

for index, row in temp_data.iterrows():
preds = paraphrase("paraphrase: {}".format(row['Questions']))

for pred in preds:
temp={}
temp['Question'] = pred
temp['FAQ_ID'] = row['Question_ID']
eval_data.append(temp)

eval_data = pd.DataFrame(eval_data)

#shuffle the evaluation dataset
eval_data=eval_data.sample(frac=1).reset_index(drop=True)

Finally, we will modify the `faq_search` function to return the faq_id for respective systems.

For System 1:

def get_faq_id_s1(query="", k=5, num_candidates=10):

if query is not None and len(query) == 0:
print('Query cannot be empty')
return None
else:
instruction="Represent this sentence for searching relevant passages: "

query_answer_emb = answer_emb_model.encode(instruction + query, normalize_embeddings=True)

payload = {
"knn": [
{
"field": "answer_emb",
"query_vector": query_answer_emb,
"k": k,
"num_candidates": num_candidates,
}],
"size": 1,
"_source":["faq_id"]
}

response = es.search(index=index_name, body=payload)['hits']['hits']

return response[0]['_source']['faq_id']

For System 2:

def get_faq_id_s2(query="", k=5, num_candidates=10):

if query is not None and len(query) == 0:
print('Query cannot be empty')
return None
else:
query_question_emb = question_emb_model.encode(query, normalize_embeddings=True)

instruction="Represent this sentence for searching relevant passages: "

query_answer_emb = answer_emb_model.encode(instruction + query, normalize_embeddings=True)

payload = {
"query": {
"match": {
"title": {
"query": query,
"boost": 0.2
}
}
},
"knn": [ {
"field": "question_emb",
"query_vector": query_question_emb,
"k": k,
"num_candidates": num_candidates,
"boost": 0.3
},
{
"field": "answer_emb",
"query_vector": query_answer_emb,
"k": k,
"num_candidates": num_candidates,
"boost": 0.5
}],
"size": 1,
"_source":["faq_id"]
}

response = es.search(index=index_name, body=payload)['hits']['hits']

return response[0]['_source']['faq_id']

Note:

The boost value is experimental. For the sake of this experiment, I have divided it based on the importance of each field in the combination. The importance of each field in the search is completely subjective and might be defined by the business itself, but if not, the general rule of thumb for the system is Answer vector > Question vector > Query.

Okay! We are all set to begin our evaluation. We will generate a prediction column for both systems and compare it with the original faq_id.

eval_data['PRED_FAQ_ID_S1'] = eval_data['Question'].apply(get_faq_id_s1)

from sklearn.metrics import accuracy_score

ground_truth = eval_data["FAQ_ID"].values
predictions_s1 = eval_data["PRED_FAQ_ID_S1"].values

s1_accuracy = accuracy_score(ground_truth, predictions_s1)

print('System 1 Accuracy: {}'.format(s1_accuracy))
System 1 Accuracy: 0.7312925170068028
eval_data['PRED_FAQ_ID_S2'] = eval_data['Question'].apply(get_faq_id_s2)

predictions_s2 = eval_data["PRED_FAQ_ID_S2"].values

s2_accuracy = accuracy_score(ground_truth, predictions_s2)

print('System 2 Accuracy: {}'.format(s2_accuracy))
System 2 Accuracy: 0.8401360544217688

With the proposed system, we can see an increase of 7-11% in accuracy compared to the asymmetric KNN search.

I have also experimented with ramsrigouthamg/t5_paraphraser but the question generated by this model are a bit more complex and verbose(in context though).

You can also use an LLM to generate the evaluation dataset and check how the system performs.

The increase in accuracy is subjective and depends on the quality of the query, i.e. how contextually rich a query is, the quality of the embeddings and/or the type of users consuming the search. To understand this better, let’s consider two kinds of end users:

  1. Generic users who want to understand some facts about your products and services: In this case, the above system would do a good job as the questions are simple, intuitive and sufficient in context.
  2. Domain/Product specific users for e.g. engineers who want to understand some intricate details about a product to set up a system or resolve some issues: In this case, the queries are more domain-specific in terms of their lexical composition, and hence, the out-of-box model embeddings would not be able to capture all the context. So, how can we solve this problem? The system's architecture would remain the same, but the overall accuracy of the search system could improve by fine-tuning these models using domain-specific data (or pre-trained domain-specific models).

Conclusion

In this article, we proposed and implemented FAQ search with a combination of search types. We looked at how Elasticsearch enabled us to combine symmetric and asymmetric semantic search, which improved the search system's performance by up to 11%. We also understood the system and resource requirements for the proposed search architecture, which would be a major deciding factor when considering adopting this approach.

Thank you so much for reading! Do reach out to me if you want to discuss some specifics or comment if you think I have missed out on something!

You can find the source notebook on my Github repository.

References:

  1. https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html#_search_multiple_knn_fields
  2. https://www.sbert.net/examples/applications/semantic-search/README.html
  3. https://huggingface.co/blog/getting-started-with-embeddings
  4. https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html
  5. https://www.elastic.co/guide/en/elasticsearch/reference/8.8/dense-vector.html
  6. https://www.kaggle.com/datasets/narendrageek/mental-health-faq-for-chatbot
  7. https://huggingface.co/spaces/mteb/leaderboard
  8. https://huggingface.co/BAAI/bge-large-en-v1.5
  9. https://huggingface.co/thenlper/gte-base
  10. https://huggingface.co/mrm8488/t5-small-finetuned-quora-for-paraphrasing

--

--