NLP-Semantic search using elasticsearch and embeddings

Sarang Mete
3 min readNov 14, 2022

--

Photo by Markus Winkler on Unsplash

Semantic Search

Search with meaning of a query and not the syntax or actual keywords.

ElasticSearch

ElasticSearch is a distributed and open-source search engine built in Java. Recently, they’ve added support for dense_vectors i.e embeddings. We are going to utilize this to build semantic search engine. I’ve used their python client.

Installation

You can download, unzip and run the binary of it.

C:\elasticsearch-8.5.0\bin\elasticsearch

Access the URL http://localhost:9200 in your browser

{
"name" : "LAPTOP-DC91R22O",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "m2IFZgIvTYOclqFe8pyGng",
"version" : {
"number" : "8.5.0",
"build_flavor" : "default",
"build_type" : "zip",
"build_hash" : "c94b4700cda13820dad5aa74fae6db185ca5c304",
"build_date" : "2022-10-24T16:54:16.433628434Z",
"build_snapshot" : false,
"lucene_version" : "9.4.1",
"minimum_wire_compatibility_version" : "7.17.0",
"minimum_index_compatibility_version" : "7.0.0"
},
"tagline" : "You Know, for Search"
}

Install python wrapper

python -m pip install elasticsearch

Search Engine Flow

I’ve used news articles as knowledge base to search query for.

1.Create and store embeddings of knowledge base(79 news articles) using sentence transformer and elasticsearch. You can refer my another repo on how to collect news articles.

2.Store knowledge base in elasticsearch.

For this first we’ve to create elasticsarch index.(Analogous to indexes in RDBMS)

settings= {
"number_of_shards": 1,
}
mappings = {
"properties": {
"embeddings": {
"type": "dense_vector",
"dims": 384,
"index": True,
"similarity": "cosine" },
"paragraph":{ "type":"text"},
"url":{ "type":"text"}
}
}

Here settings contain configuration for data distribution etc.

mappings are nothing but schema —data description.

In our case, we’ve 3 columns

embeddings — embedding vector of a news article created using sentence transformer with 384 dimensions and use cosine_similarity when searching.

paragraph — actual news article text

url — url of news article

Using these settings and mappings, we’ll create index

es = Elasticsearch("http://localhost:9200")
es.indices.create(index='articles', settings=settings, mappings=mappings)

3.When user types in a query then convert input query to embedding query and compare with knowledge base vectors using cosine similarity.

token_vector = get_embeddinngs(query_text)
es_query ={
"size":5,
"knn": {
"field": "embeddings",
"query_vector": token_vector,
"k": 10,
"num_candidates": 100
},

In this elasticsearch query, we are asking to fetch 5 top results from knowledge base by using knn algorithm which will use cosine_similarity(define earlier in index creation) to get similar articles with input query.

Image by Author

I’ve created a complete end to end project for semantic search. You can refer it here. I’ve used news articles as search space, you can easily replace this with you requirement, any knowledge base like project documents, enterprise documents, web data etc.

Please note this is just demo, for actual production ready search engine, a lot advanced functionalities of elasticsearch can be used which is not a scope of this article.

If you liked the article or have any suggestions/comments, please share them below!

Let’s connect and discuss on LinkedIn

--

--