Semantic Search using NLP

An starter guide on building intelligent search engine using semantic understanding of search queries

Ajit Rajput

Published in

Analytics Vidhya

6 min readAug 31, 2020

Introduction

Search engines have been with us for several decades as an integral part of our digital life. We are casually searching over billions of web pages to retrieve and share information from various resources. While humans are very good at conversational context and background knowledge which helps them to deal with intrinsic ambiguity of words, it may not be true in case of search engines, especially when it comes to out-of-the-vocabulary searches

The answer to this problem is semantic search. Using the latest insights from NLP research, it is possible to train a Language Model on a large corpus of documents. Afterwards, the model is able represent documents based on their “semantic” content. In particular, this includes the possibility to search for documents with semantically similar content.

Semantic search means understanding the intent behind the query and representing the “knowledge in a way suitable for meaningful retrieval,” according to Towards Data Science.

In this work, we will retrieve relevant movie titles using semantic search based on the concept of Natural Language processing (NLP)

For those who want to jump directly to the code, here’s the link to my Kaggle Notebook.

Semantic Search Engine using NLP

Explore and run machine learning code with Kaggle Notebooks | Using data from Movies Similarity

www.kaggle.com

Keyword Search Vs Semantic Search

At first, search engines were lexical: the search engine looked for literal matches of the query words, without understanding of the query’s meaning and only returning links that contained the exact query.By using regular keyword search, a document either contains the given word or not, and there is no middle ground

On the other hand, “Semantic Search” can simplify query building, because it is supported by automated natural language processing programs i.e. using Latent Semantic Indexing — a concept that search engines use to discover how a keyword and content work together to mean the same thing.

According to Wikipedia,
LSA (Latent Semantic Analysis) also known as LSI (Latent Semantic Index) LSA is a technique in natural language processing of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.

LSI adds an important step to the document indexing process. LSI examines a collection of documents to see which documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with less words in common to be less close.

In brief, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that don’t contain the keyword at all.

Load the data

We will now load the movies data csv into dataframe and quickly peek into the columns and data provided

Data Cleaning and Pre-processing

Data pre-processing is one of the most significant step in text analytics. The purpose is to remove any unwanted words or characters which are written for human readability, but won’t contribute to topic modelling in anyway.

The following function applies regular expression for matching patterns of unwanted text and removing/replacing them.

Now let us apply the data-cleaning and pre-processing function to our movies “wiki_plot” column and store the cleaned, tokenized data into new column

Building Word Dictionary

In the next step we will build the vocabulary of the corpus in which all the unique words are given IDs and their frequency counts are also stored. You may note that we are using gensim library for building the dictionary. In gensim, the words are referred as “tokens” and the index of each word in the dictionary is called ID

You can see that there are 2 additional steps performed after creating the dictionary.

All the tokens in the dictionary which either have occurred in less than 4 articles or have occurred in more than 20% of the articles are removed from the dictionary, as these words will not be contributing to the various themes or topics.
Removing content neutral words from the dictionary and additional stop-words.

Feature Extraction (Bag of Words)

A bag of words model, or BoW for short is a way of extracting features from text for use in modelling, such as with machine learning algorithms. It is a representation of tet that describes teh occurence of words within a document. It involves two things

A vocabulary of known words
A measure of the presence of known words

The doc2bow method of dictionary, iterates through all the words in the text, if the word already exists in the corpus, it increments the frequency count, other wise it inserts the word into the corpus and sets it frequency count to 1

Build Tf-Idf and LSI Model

Tf-Idf means, Term frequency-Inverse Document Frequency. it is a commonly used NLP model that helps you determine the most important words in each document in the corpus. Once the Tf-Idf is build, pass it to LSI model and specify the num of features to build

Time for Semantic Search

Now comes the fun part. With the index of movies initialized and loaded, we can use it to find similar movies based

We will input a search query and model will return relevant movie titles with “Relevance %” which is the similarity score. The higher the similarity score, the more similar the query to the document at the given index

Below is the helper function to search the index, sort and return the results

# search for movie tiles that are related to below search parameters
search_similar_movies('crime and drugs ')

The model returns movie titles with “Relevance %”. Definitely, the top list movies are related to crimes and drugs.

# search for movie tiles that are related to below search parameters
search_similar_movies('violence protest march')

Here the top most movie title “Gandhi” is surely related to non-violence protests

Closing Notes

In general usage, computing semantic relationships between textual data enables to recommend articles or products related to given query, to follow trends, to explore a specific subject in more details.

In this article we saw the basic version of how semantic search can be implemented. There are many ways to further enhance it using newer deep learning models.

This was just an attempt to showcase my learning journey into field of NLP and Machine Learning. Your valuable feedback, comments and suggestions are welcome!

Hope you all like it and vote for it.