NLP-Semantic search using elasticsearch and embeddings
Semantic Search
Search with meaning of a query and not the syntax or actual keywords.
ElasticSearch
ElasticSearch is a distributed and open-source search engine built in Java. Recently, they’ve added support for dense_vectors i.e embeddings. We are going to utilize this to build semantic search engine. I’ve used their python client.
Installation
You can download, unzip and run the binary of it.
C:\elasticsearch-8.5.0\bin\elasticsearch
Access the URL http://localhost:9200
in your browser
{
"name" : "LAPTOP-DC91R22O",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "m2IFZgIvTYOclqFe8pyGng",
"version" : {
"number" : "8.5.0",
"build_flavor" : "default",
"build_type" : "zip",
"build_hash" : "c94b4700cda13820dad5aa74fae6db185ca5c304",
"build_date" : "2022-10-24T16:54:16.433628434Z",
"build_snapshot" : false,
"lucene_version" : "9.4.1",
"minimum_wire_compatibility_version" : "7.17.0",
"minimum_index_compatibility_version" : "7.0.0"
},
"tagline" : "You Know, for Search"
}
Install python wrapper
python -m pip install elasticsearch
Search Engine Flow
I’ve used news articles as knowledge base to search query for.
1.Create and store embeddings of knowledge base(79 news articles) using sentence transformer and elasticsearch. You can refer my another repo on how to collect news articles.
2.Store knowledge base in elasticsearch.
For this first we’ve to create elasticsarch index.(Analogous to indexes in RDBMS)
settings= {
"number_of_shards": 1,
}
mappings = {
"properties": {
"embeddings": {
"type": "dense_vector",
"dims": 384,
"index": True,
"similarity": "cosine" },
"paragraph":{ "type":"text"},
"url":{ "type":"text"}
}
}
Here settings contain configuration for data distribution etc.
mappings are nothing but schema —data description.
In our case, we’ve 3 columns
embeddings — embedding vector of a news article created using sentence transformer with 384 dimensions and use cosine_similarity when searching.
paragraph — actual news article text
url — url of news article
Using these settings and mappings, we’ll create index
es = Elasticsearch("http://localhost:9200")
es.indices.create(index='articles', settings=settings, mappings=mappings)
3.When user types in a query then convert input query to embedding query and compare with knowledge base vectors using cosine similarity.
token_vector = get_embeddinngs(query_text)
es_query ={
"size":5,
"knn": {
"field": "embeddings",
"query_vector": token_vector,
"k": 10,
"num_candidates": 100
},
In this elasticsearch query, we are asking to fetch 5 top results from knowledge base by using knn algorithm which will use cosine_similarity(define earlier in index creation) to get similar articles with input query.
I’ve created a complete end to end project for semantic search. You can refer it here. I’ve used news articles as search space, you can easily replace this with you requirement, any knowledge base like project documents, enterprise documents, web data etc.
Please note this is just demo, for actual production ready search engine, a lot advanced functionalities of elasticsearch can be used which is not a scope of this article.
If you liked the article or have any suggestions/comments, please share them below!
Let’s connect and discuss on LinkedIn