Semantic Search using NLP
An starter guide on building intelligent search engine using semantic understanding of search queries
Introduction
Search engines have been with us for several decades as an integral part of our digital life. We are casually searching over billions of web pages to retrieve and share information from various resources. While humans are very good at conversational context and background knowledge which helps them to deal with intrinsic ambiguity of words, it may not be true in case of search engines, especially when it comes to out-of-the-vocabulary searches
The answer to this problem is semantic search. Using the latest insights from NLP research, it is possible to train a Language Model on a large corpus of documents. Afterwards, the model is able represent documents based on their “semantic” content. In particular, this includes the possibility to search for documents with semantically similar content.
Semantic search means understanding the intent behind the query and representing the “knowledge in a way suitable for meaningful retrieval,” according to Towards Data Science.
In this work, we will retrieve relevant movie titles using semantic search based on the concept of Natural Language processing (NLP)
For those who want to jump directly to the code, here’s the link to my Kaggle Notebook.
Keyword Search Vs Semantic Search
At first, search engines were lexical: the search engine looked for literal matches of the query words, without understanding of the query’s meaning and only returning links that contained the exact query.By using regular keyword search, a document either contains the given word or not, and there is no middle ground
On the other hand, “Semantic Search” can simplify query building, because it is supported by automated natural language processing programs i.e. using Latent Semantic Indexing — a concept that search engines use to discover how a keyword and content work together to mean the same thing.
According to Wikipedia,
LSA (Latent Semantic Analysis) also known as LSI (Latent Semantic Index) LSA is a technique in natural language processing of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.
LSI adds an important step to the document indexing process. LSI examines a collection of documents to see which documents contain some of those same words. LSI considers documents that have many words in common to be semantically close, and ones with less words in common to be less close.
In brief, LSI does not require an exact match to return useful results. Where a plain keyword search will fail if there is no exact match, LSI will often return relevant documents that don’t contain the keyword at all.
Load the data
We will now load the movies data csv into dataframe and quickly peek into the columns and data provided
Data Cleaning and Pre-processing
Data pre-processing is one of the most significant step in text analytics. The purpose is to remove any unwanted words or characters which are written for human readability, but won’t contribute to topic modelling in anyway.
The following function applies regular expression for matching patterns of unwanted text and removing/replacing them.
Now let us apply the data-cleaning and pre-processing function to our movies “wiki_plot” column and store the cleaned, tokenized data into new column
Building Word Dictionary
In the next step we will build the vocabulary of the corpus in which all the unique words are given IDs and their frequency counts are also stored. You may note that we are using gensim library for building the dictionary. In gensim, the words are referred as “tokens” and the index of each word in the dictionary is called ID
You can see that there are 2 additional steps performed after creating the dictionary.
- All the tokens in the dictionary which either have occurred in less than 4 articles or have occurred in more than 20% of the articles are removed from the dictionary, as these words will not be contributing to the various themes or topics.
- Removing content neutral words from the dictionary and additional stop-words.
Feature Extraction (Bag of Words)
A bag of words model, or BoW for short is a way of extracting features from text for use in modelling, such as with machine learning algorithms. It is a representation of tet that describes teh occurence of words within a document. It involves two things
- A vocabulary of known words
- A measure of the presence of known words
The doc2bow method of dictionary, iterates through all the words in the text, if the word already exists in the corpus, it increments the frequency count, other wise it inserts the word into the corpus and sets it frequency count to 1
Build Tf-Idf and LSI Model
Tf-Idf means, Term frequency-Inverse Document Frequency. it is a commonly used NLP model that helps you determine the most important words in each document in the corpus. Once the Tf-Idf is build, pass it to LSI model and specify the num of features to build
Time for Semantic Search
Now comes the fun part. With the index of movies initialized and loaded, we can use it to find similar movies based
We will input a search query and model will return relevant movie titles with “Relevance %” which is the similarity score. The higher the similarity score, the more similar the query to the document at the given index
Below is the helper function to search the index, sort and return the results
# search for movie tiles that are related to below search parameters
search_similar_movies('crime and drugs ')
The model returns movie titles with “Relevance %”. Definitely, the top list movies are related to crimes and drugs.
# search for movie tiles that are related to below search parameters
search_similar_movies('violence protest march')
Here the top most movie title “Gandhi” is surely related to non-violence protests
Closing Notes
In general usage, computing semantic relationships between textual data enables to recommend articles or products related to given query, to follow trends, to explore a specific subject in more details.
In this article we saw the basic version of how semantic search can be implemented. There are many ways to further enhance it using newer deep learning models.
This was just an attempt to showcase my learning journey into field of NLP and Machine Learning. Your valuable feedback, comments and suggestions are welcome!
Hope you all like it and vote for it.