Building a Semantic Search Engine with Machine Learning and Jupyter Notebooks

Abhishek Ranjan
3 min readApr 15, 2023

--

Introduction

Semantic search is an advanced technique that aims to improve search accuracy by understanding the user’s intent and the contextual meaning of terms. By leveraging machine learning algorithms, we can build a semantic search engine that returns more relevant results, enhancing the user experience. In this article, we will guide you through the process of creating a semantic search engine using Python, machine learning, and Jupyter Notebooks. We will also provide examples of data and sample outputs.

Prerequisites

Before diving into the code, make sure you have the following:

  • Basic knowledge of Python programming
  • Familiarity with Jupyter Notebooks
  • Understanding of machine learning concepts
  • Installed Python 3.x and Jupyter Notebooks

Setting Up the Environment

First, create a virtual environment and install the required libraries:

python -m venv semantic_search
source semantic_search/bin/activate
pip install numpy pandas scikit-learn tensorflow_hub sentence_transformers

Next, open Jupyter Notebooks and create a new notebook called “Semantic_Search.ipynb”.

Data Preparation and Preprocessing

We will use a dataset of articles or documents as our search corpus. For demonstration purposes, let’s create a small dataset with five articles. In practice, you would use a larger dataset.

Clean and preprocess the text data:

data = pd.DataFrame({
'title': [
'Introduction to Natural Language Processing',
'Deep Learning for Computer Vision',
'Reinforcement Learning: An Overview',
'A Comprehensive Guide to Convolutional Neural Networks',
'Text Classification with Machine Learning'],
'text': [
'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language...',
'Deep learning has revolutionized the field of computer vision, enabling computers to interpret and understand visual information with unprecedented accuracy...',
'Reinforcement learning is an area of machine learning where an agent learns to make decisions by interacting with its environment...',
'Convolutional Neural Networks (CNNs) are a class of deep learning algorithms that have shown great success in various computer vision tasks...',
'Text classification is a common task in natural language processing, which involves assigning predefined categories to a given text...']
})

Clean and preprocess the text data:

def preprocess_text(text):
# Add your text preprocessing steps here (e.g., lowercasing, removing stopwords, stemming)
return preprocessed_text

data['cleaned_text'] = data['text'].apply(preprocess_text)

Creating the Semantic Model

We will use a pre-trained Sentence-BERT model to generate document embeddings. Load the model and create embeddings:

model = SentenceTransformer('paraphrase-distilroberta-base-v2')

embeddings = model.encode(data['cleaned_text'].tolist(), convert_to_tensor=True)

Implementing the Search Function

Define the search function that takes a query, preprocesses it, calculates the cosine similarity between the query and document embeddings, and returns the top results:

from sklearn.metrics.pairwise import cosine_similarity

def search(query, top_n=5):
query_preprocessed = preprocess_text(query)
query_embedding = model.encode(query_preprocessed, convert_to_tensor=True)
similarities = cosine_similarity(query_embedding, embeddings)
top_indices = np.argsort(-similarities[0])[:top_n]
top_results = data.iloc[top_indices].reset_index(drop=True)
top_results['similarity'] = similarities[0][top_indices]
return top_results

Evaluating the Model

Test the search function with a sample query:

query = "natural language processing"
results = search(query, top_n=3)
print(results)

Sample output:

title  \
0 Introduction to Natural Language Processing
1 Text Classification with Machine Learning
2 Reinforcement Learning: An Overview
text \
0 Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language...
1 Text classification is a common task in natural language processing, which involves assigning predefined categories to a given text...
2 Reinforcement learning is an area of machine learning where an agent learns to make decisions by interacting with its environment...
similarity
0 0.921572
1 0.687315
2 0.642183

Fine-tune the model or adjust the search function as needed to improve performance.

Conclusion

In this article, we showed how to create a semantic search engine using Python, machine learning, and Jupyter Notebooks. By following these steps, you can build a powerful search engine that understands user intent and returns more relevant results. Experiment with different pre-trained models or fine-tune them on your data to further enhance search performance.

--

--