Analytics Vidhya
Published in

Analytics Vidhya

Leaping into Semantic/Neural Search with ElasticSearch, Faiss using Haystack

In this tutorial, we’re gonna implement a rudimentary Semantic Search engine using Haystack. we’ll use ElasticSearech and Faiss (Facebook AI Similarity Search) as DocumentStores.

Photo by Gozha Net on Unsplash

Below are the segments I’m gonna talk about:

  1. Intro to Semantic Search & Terminologies
  2. Implementation nit&grit
  • Environment Setup
  • Dataset preparation
  • Indexing & Searching

Intro to Semantic Search & Terminologies

In recent times, with NLP (natural language processing) advancement and availability of vast computing power (GPU, TPU unit, etc.), Semantic Search is making its place in the search industry. Contrary to lexical or syntactical search, Semantic/neural search focuses more on the Intent and Semantics of the query. Representing your query/documents as an n-dim vector (embedding) using a neural network (trained on your custom data or pretrained) is the crux of this Semantic Search.

  • Haystack: Haystack is an open-source framework for building end-to-end question-answering systems for large document collections. You can read more about it here.
  • FAISS: FAISS is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other. If you want to read in-depth about it, I suggest you read this amazing blog.

Implementation nit&grit

Now we’ll go into technical implementation details, so if you’re more interested in the coding part, you can skip further this part and directly jump into the collab notebook.

  • Environment Setup: I’ve used Google collab notebook (GPU runtime) because creating embedding is computationally expensive. First, install the required libs:
!pip install git+https://github.com/deepset-ai/haystack.git OR !pip install farm-haystack!pip install sentence-transformers

The first package will install Haystack python library and the second one, sentence-transformers which we’ll use to create embedding. Sentence Transformer is very handy in providing various pretrained transformer-based models to embed a sentence or document. To check out these models (use-case wise), click here.

  • Dataset preparation: For this setup, I’ve downloaded Foodb dataset. FooDB is the world’s largest and most comprehensive resource on food constituents, chemistry, and biology. FooDB is offered to the public as a freely available resource. I’ve used only Content.json, below is the dataframe I got after processing:
dataframe.head()

And we’ll use only three columns i.e. code, url, product_name in indexing. Haystack provides a handy method to index List[Dict]. so I’ve converted the above dataframe to the below format (as mentioned in Haystack docs):

Sample format for Haystack indexing
  • Indexing & Searching: Haystack provides the three building blocks for indexing and searching:

a. DocumentStore: Database in which you want to store your data. They support different kinds of databases like Elasticsearch, FAISS, InMemory, SQL, etc. For now, We’ll load the data in both Elasticsesaarch and FAISS and also seek the comparison later.

#FAISS Indexing
from haystack.document_store.faiss import FAISSDocumentStore
document_store_faiss = FAISSDocumentStore(faiss_index_factory_str="Flat",return_embedding=True)#ES Indexing
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
document_store_es = ElasticsearchDocumentStore(host="localhost", index="food_haystack_embedding", similarity='cosine')

Also, For FAISS indexing, the similarity metric is ‘dot_product’ now but for ES, ‘cosine’ similarity is available. There are various args in FAISS index for optimization with which you can play around. In my case, I’m sticking to ‘Flat’ indexing because my dataset isn’t of that volume.

b. Retriever: Filter for extracting potential candidates for the query. Currently, they support BM25, TF-IDF, Embedding, DPR (Dense Passage Retrieval). We’re using Embedding Retriever. Also, for creating embedding, we’re using distilroberta-base-msmarco-v2 model (pretrained on Microsoft MACRO. dataset).

#FAISS Retriever initialization
retriever_faiss = EmbeddingRetriever(document_store_faiss,
embedding_model='distilroberta-base-msmarco-v2',model_format='sentence_transformers')
#ES Retriever initialization
retriever_es = EmbeddingRetriever(document_store_es, embedding_model='distilroberta-base-msmarco-v2', model_format='sentence_transformers')
#Running the process for indexing
# Delete existing documents in documents store
document_store_faiss.delete_all_documents()
# Write documents to document store
document_store_faiss.write_documents(food_data_to_upload)
# Add documents embeddings to index
document_store_faiss.update_embeddings(retriever=retriever)

So first, it will index all the data with write_documents() method. Then it will create an embedding of each doc (doc[‘text’]) and store it in each corresponding index (in-place) with update_embeddings() method, to create embedding it will use the model which you’ve mentioned in the retriever initialization i.e. distilroberta-base-msmarco-v2 here. Also, Haystack facilitates batch processing in bulk indexing.

%time taken in indexing

#docs: 338,487

FAISS: 386.73 sec

ElasticSearch: 2329.32 sec

Note: As you can see, FAISS indexing approx x6 fast compared to ES.

c. Reader: Though we’re not using this component in our task, it is said to be a core component in QA systems provided by Haystack. It takes the output of Retriever (potential candidates) and try to give you the best match for your query. It harnesses the power of transformer-based language models to pick the best candidate.

d. Results: Let’s dive in to see how our neural search is performing 😋

q = 'pipneapple banana cake'print('-'*100)
print('FAISS')
print(get_search_result(q, retriever_faiss))
print('-'*100)
print('ElasticSearch Dense')
print(get_search_result(q, retriever_es))
print('-'*100)
#return the (text, url) tuple.
Output

Conclusion

So that’s it. You’ve leveraged a big transformer deep neural network to make a neural search engine here with just a few lines of code. This is possible because of the Abstraction provided by Haystack lib (Deepset.ai). I found it easy to use and very handy in Data science workflow integration. They’re also providing a Pipeline method in which you can do plugging and chaining of these components (storing, updating, retrieving) smoothly.

Though I haven’t talked about Inference time, accuracy in this blog, but I’ll be covering these things in the next part of this blog. Stay tuned.

Thanks for the read. Anywho, if you have any feedback, I’m all ears. below Below are Social Media links:

Instagram || LinkedIn || Twitter

--

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Recommended from Medium

Review — Learning classification with Unlabeled Data

The training dilemma: loss vs profit function?

The Cone of Silence: Speech Separation by Localization

The Cone of Silence: Speech Separation by Localization

Why we use Activation Function

MOUSE Movement modelling to predict online Fraud

How MNC’s leverage the power of Neural Networks

Chess Transformer — Neural Network That Learns To Play Chess

Neural Networks Part 2: Building Neural Networks & Understanding Gradient Descent.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Prateek Yadav

Prateek Yadav

NLP Engineer @LexisNexis India || www.impyadav.com

More from Medium

Creating General-Purpose Vectors for Deep Learning using Word2Vec Algorithm.

Add Interpretability To Your NLP Model the Easy Way Using Captum

How to bring your own machine learning model to databases

From Machine Learning to Automated Machine Learning