Ranking Documents using Elastic Search to use in Machine Learning
Do you want to create a chatbot that uses an extensive knowledge base and quickly returns the documents? Or perhaps, you want to search the Database contextually? Let’s learn how we can use the power of Elastic Search 8.3 to get exactly what we want.
As you know, for NLP tasks, we use word embeddings to convert text to vectors. These vectors capture the sentimental or contextual value of the document. It depends on the model we are using to generate word embeddings.
Configuring Elastic Search
For the purpose of this tutorial, download the latest version of Elastic Search for your OS from here.
Start your Elasticsearch using the executable file or .bat file for windows. You can turn off the security in config of elastic search to prevent it from checking for certificates.
Installing Required Libraries
To generate word embeddings, we are going to use Sentence Transformer. Let’s install the required libraries in your virtual environment:
pip install sentence-transformers
pip install elasticsearch
To learn about the Elasticsearch client for Python, check out their documentation
Time to Code
Now that our required libraries are installed, import them:
from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch
Initialize both, our elastic search client and our embedding generator:
sentence_transformer = SentenceTransformer("all-mpnet-base-v2")
es_client = Elasticsearch("http://localhost:9200", verify_certs=False, request_timeout=60)
By default, elastic search is locally on localhost:9200 port. So, we will connect to that.
First of all, we have to define an index using the following method. It is like creating a schema. We specified embedding dimensions to 768 as that is what our Sentence Transformer outputs.
es_client.options(ignore_status = 404).indices.delete(index = INDEX_NAME)
es_client.options(ignore_status = 400).indices.create(
index = INDEX_NAME,
mappings = {
properties: {
embedding: {
type: 'dense_vector',
dims: EMBEDDING_DIMS,
},
document: {
type: 'text',
},
},
}
)
Now, we can add data to our Elastic Search
embedding = sentence_transformer.encode(doc)
data = {
document: doc,
embedding: embedding,
}
es_client
.options(max_retries = 0)
.index(index = INDEX_NAME, document = data)
In the above snippet, we are converting our doc (one sample data point) into embedding, then we are adding that embedding along with the doc to Elastic Search. You can set the index name to any string value. Run a loop with the above code to add all your data points into the Elastic Search.
Here comes the crux of the hard work we have done so far, the search. We have to search the database using a query.
embedding = sentence_transformer.encode(query)
es_result = es_client.search(
index=INDEX_NAME,
size = 3,
from_= 0,
source=["document"],
query = {
"script_score": {
"query": {
"match": {
"document": query
}
},
"script": {
"source": """
(cosineSimilarity(params.query_vector, "embedding") + 1)
""",
"params": {
"query_vector": embedding,
},
},
}
}
)
We will create word embeddings of our query. We will apply cosine similarity to the embeddings to get the one that is most contextually similar. For this purpose, we will use the built-in Cosine Similarity provided by the Elastic Search.
Using the match
, we will ES BM25 which will take into account how much query is relevant to the document. Make sure to use the same index name that we used in adding the data. The value size=3
means that we will get the top 3 results in our query set. You will get the results in es_result
variable and you can access your desired values from there.
Where can we use it?
For example, we have a problem with FAQs, the user asks a query in their own wording and I want to get the top document or answer in this case that solves their query. This will be a good use to quickly get the desired results.
That’s all folks!
The possibilities of using this technique are endless. You can get n-ranked similar documents based on your query and can use it for various problems such as QAs, FAQs, Chatbots, and whatnot. Let me know down in the comments which problem are you tackling with this technique.