Vector-Based Semantic Search using Elasticsearch

Published in

Version 1

6 min readJun 4, 2020

Photo by Glenn Carstens-Peters on Unsplash

Semantic Search, a form of search usually used in search engines, serves content to the users understanding the intent and meaning of the user’s search query. This search is a step ahead of the traditional text and keyword match search. Traditional keyword search does not include the lexical variants or conceptual matches to the user’s search phrase. If the exact combination of the words used in the user’s query is not present in the overall content, irrelevant results would be returned to the user. Also, using semantic search, users can ask their questions in natural language rather than the exact keywords.

Semantic Search is primarily based on two concepts:

Search intent of the user: This implies understanding the reason why the user has asked the particular query. This could be anything from wanting to learn, find or buy something, etc. If the intent is understood well, search engines can provide the most relevant results to the users.
Relationship between the words in the search phrase: It is important to understand the meaning of all the words together in the search phrase rather than the individual words in them. This means understanding the relationship between those words thus displaying results that more conceptually similar to the user’s query.

Use-cases of Semantic Similarity Search:

Question-answering system: If a collection exists of several frequently asked questions, it will find questions that mean the same to the user’s new query and return stored results of similar questions.
Document Content Search: Imagine a scenario where an organization has several documents and a user wishes to find an answer to a question from those documents. It would be easier to quickly locate the right document by calculating the similarity between the user’s question and the content of the query rather than reading through different irrelevant documents before reaching the right one.

There are many different approaches to implementing semantic search. The NLP community has provided us with a capability called text embeddings. Text embedding is a technique of converting words and sentences into fixed-size dense numeric vectors. In short, unstructured text can be converted to vectors. These vectors help to capture the semantics of the text, i.e. the contextual meaning of the text that can be used to find the similarity between the user query and the webpages. If the text embeddings to two texts are similar, it means that the two texts are semantically similar. These vectors can be indexed in Elasticsearch to perform semantic similarity searches.

Many techniques are available today in python to convert text to vectors like — bag-of-words, Latent-Dirichlet-Allocation (LDA), n-gram embeddings, Doc2Vec, etc. In this article, we will use an open-source pre-trained model called the Universal Sentence Encoder to easily convert the text to vectors.

What is Universal Sentence Encoder?

The Universal Sentence Encoder converts the text to numeric dense vectors that can be used for NLP tasks. This model is publicly available on Tensorflow-Hub. This model takes in any length English text — sentences, phrases, or paragraphs and outputs a 512-dimensional vector. The model is trained by using several data sources to accomplish varied NLP tasks. Some of the important applications of the Universal Sentence Encoder are:

Text embedding that can be used in the pre-processing stage of any NLP based machine/deep learning projects.
To detect similar paragraphs/sentences etc.
To identify clusters of semantically similar text.

Let’s now see how to utilize this model for performing text embeddings. Ensure to have Python installed followed by Tensorflow.

Import the essential libraries

import tensorflow as tf
import tensorflow_hub as hub

2. Download the Model to your local system. The model is approximately 1GB in size. Hence, based on your internet connectivity, it can take time to download. Hence, it is advisable to download the model only once, use the downloaded model for as many embeddings as you wish.

graph = tf.Graph()
with tf.Session(graph = graph) as session:
print(“Downloading pre-trained embeddings from tensorflow hub…”)
embed = hub.Module(“https://tfhub.dev/google/universal-sentence-encoder/2")
text_ph = tf.placeholder(tf.string)
embeddings = embed(text_ph)
print(“Done.”)
print(“Creating tensorflow session…”)
session = tf.Session()
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
print(“Done.”)

The output of this will look like:

3. Define a function to use the model.

#Function to convert the text to vector
def embed_text(text):
vectors = session.run(embeddings, feed_dict={text_ph: text})
return [vector.tolist() for vector in vectors]

4. Call the above function and you are good to go.

text=”Oranges have a lot of Vitamin C.”
text_vector = embed_text([text])[0]
print(“Text to be embedded: {}”.format(text))
print(“Embedding size: {}”.format(len(text_vector)))
print(“Obtained Embedding[{},…]\n”.format(text_vector[:5]))

This is how you can use the Universal Sentence Encoder model to obtain text embeddings. Let us now see how these text embeddings can be integrated with Elasticsearch. The latest versions of Elasticsearch (7.3+) support a new data type called dense_vector having different metrics like cosine-similarity, Euclidean distance and calculated using a script_score. A dense-vector field will save the text embeddings i.e. the numeric vectors. These vectors can then be indexed in the Elastic Search and similarity can be obtained between the User’s query vector and the Indexed content Vector. We will use the Cosine Similarity feature of the Elasticsearch to achieve document scoring.

The steps to achieve this are:

Have a set of documents ready. Obtain the text embeddings of these documents using the Universal Sentence Encoder as explained above.
Index these embeddings into Elasticsearch. Please refer to my previous article to know more about Elasticsearch setup and indexing.

To index the vectors, it is important to define the mappings for the index like below.

{“settings”: {“number_of_shards”: 2,“number_of_replicas”: 1},
“mappings”: {“dynamic”: “true”,“_source”: {“enabled”: “true”},
“properties”: {
“Document_name”: {
“type”: “text”
},
“Doc_vector”: {
“type”: “dense_vector”,
“dims”: 512
}}}}

Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. For instance, use mappings to define the data type of the fields that will be stored in the index. If we don’t provide any mapping, Elasticsearch automatically detects the data type of the fields and creates indexes. However, for dense_vector, it is important to explicitly provide the mappings with the dimensions. Here, we have kept the vector dimension as 512 as the Universal Sentence Encoder model outputs a vector of 512 dimensions.

When indexing a Doc_vector field, Elasticsearch will check that it has the same number of dimensions as specified in the mapping. Here, we have indexed all these vectors in an index called “documents”.

Once, all the vectors are indexed, if you hit the URL http://localhost:9200/cars/_search?pretty in your browser, it should return this.

3. Take a user Query and convert it to vector. Obtain the User_Query_Vector by using the same Universal Sentence Encoder approach as explained above.

4. Calculate the Cosine similarity between the User_Query_Vector and the Doc_vector indexed in Elasticsearch as below.

script_query = {
“script_score”: {
“query”: {“match_all”: {}},
“script”: {
“source”: “cosineSimilarity(params.query_vector, doc[‘Doc_vector’]) “,
“params”: {“query_vector”: User_Query_Vector }
}}}
response = ESclient.search(index=INDEX_NAME,body={“size”: 10,“query”: script_query,“_source”: {“includes”: [“Document_name”]}})

The above search will return the list of documents with decreasing Confidence Scores with the most similar ones being at the top.

This was a simple example of using Elasticsearch for performing semantic similarity search.

Thanks for reading this article.

If you have any feedback, please let me know in the comments or get in touch on LinkedIn.

About the Author

Sharanya Shenoy is an associate consultant at Version 1, who has been working in the Innovation Labs since March 19, innovating with several disruptive technologies. A post-graduate in Data Science, Sharanya’s main focus areas are machine learning and AI.

References:

Advances in Semantic Textual Similarity

The recent rapid progress of neural network-based natural language understanding research, especially on learning…

ai.googleblog.com

Text similarity search in Elasticsearch using vector fields

From its beginnings as a recipe search engine, Elasticsearch was designed to provide fast and powerful full-text…

www.elastic.co

TensorFlow Hub

The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, text similarity…