Build your own semantic search web app with Sentence Transformers and FAISS

Semantic search has revolutionized the world by allowing anyone to search and retrieve documents based on context instead of exact keyword matching.

Sumit
Operations Research Bit
5 min readJan 3, 2024

--

This can be chained with an open-source or closed-source large language model to create a RAG (Retrieval Augmented Generation) pipeline as well. Implementing this will add one more layer of intelligence where anyone can query the documents and get a personalized answer that is contextualized question answering.

Fig 1: Flowchart of Document Search

In this article, we learn to create a web application capable of semantic search from a document, which will be fed in the form of a CSV that has column names such as submission_name, date, page_no, text_embedded, and text_ocr. I will encourage you to experiment with different datasets. If you can grasp this, implementing a complex dataset is a no-brainer. As the headline suggests, we are going to harness the power of two cutting-edge methods: FAISS for indexing semantic vectors and Sentence Transformers for encoding sentences into these vectors. You are open to using any available open-source sentence encoders. In our use case, we will be using “all-MiniLM-L6-v2”.

Setting Up the Environment

Establishing a virtual environment is a key practice to ensure code reproducibility and facilitate the recreation of the environment in the future.

conda create -p venv python==3.8 -y 
conda activate venv/
from flask import Flask, render_template, request
import pandas as pd
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
from sentence_transformers import InputExample
import os
import pickle

Data Preparation

Below is a snapshot of the data frame of the document I will be using to test the engine:

Fig 2: DataFrame

If you carefully look at the above data frame, there are two columns: text_embedded and text_ocr. We will be using both of the texts for semantic search.

The decision to utilize both arises from the presence of NaN values in one column, and the desire not to overlook additional text features. To optimize for efficiency, I’ve leveraged the latter column only in instances where the data points are NaN.

Import all the necessary libraries and set up the Flask application.

from flask import Flask, render_template, request
import pandas as pd
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
from sentence_transformers import InputExample
import os
import pickle
app = Flask(__name__, template_folder='path to your templates folder')

cache_dir = "./cache"
if not os.path.exists(cache_dir):
os.makedirs(cache_dir)

file_path = "path to the csv file"
pdf = pd.read_csv(file_path)
pdf["id"] = pdf.index
pdf_subset = pdf.head(1000)

Encoding Texts with Sentence Transformers

Writing the function example_create_fn that takes a Pandas series named doc1 as input and returns an instance of InputExample from the sentence transformers library. Apply this function to the text to be encoded later.

def example_create_fn(doc1: pd.Series) -> InputExample:
if pd.isnull(doc1["text_embedded"]):
return InputExample(texts=[doc1["text_ocr"]])
else:
return InputExample(texts=[doc1["text_embedded"]])

faiss_train_examples = pdf_subset.apply(lambda x: example_create_fn(x), axis=1).tolist()
model = SentenceTransformer("all-MiniLM-L6-v2", cache_folder="./cache")

def encode_title_or_text(row):
if pd.isnull(row['text_embedded']):
return model.encode(row['text_ocr'])
else:
return model.encode(row['text_embedded'])

Dense Vectors, FAISS Indexing, and search content

The next step is to create the pickle file for the dense vectors. Load dense vectors from the file if they exist; otherwise, compute and save them. As we see below, I have defined the file names dense_vectors_file for storing dense vectors and faiss_index_file for storing the Faiss index. Normalizing vectors is a common practice in machine learning and similarity search tasks, and it’s often done to ensure that the vectors have a consistent scale. In our use case, we are going to use the L2 norm, which means that the length of each vector is 1 after normalization.

Note: It is crucial to save the pkl file and load it. It would be a foolish move to compute it everytime for the same dataset. significantly increasing the loading time of the web app. Such an approach won’t bring joy to anyone ;_;

dense_vectors_file = 'dense_vectors.pkl'
faiss_index_file = 'faiss_index.index'

if os.path.exists(dense_vectors_file):
with open(dense_vectors_file, 'rb') as file:
faiss_title_embedding = pickle.load(file)
else:
faiss_title_embedding = np.vstack(pdf_subset.apply(encode_title_or_text, axis=1).values)
faiss.normalize_L2(faiss_title_embedding)
with open(dense_vectors_file, 'wb') as file:
pickle.dump(faiss_title_embedding, file)

Likewise, let’s set up the FAISS index. This involves checking if the index file exists; if not, we compute and save it. The normalized vectors, enriched with their respective IDs from the DataFrame’s “id” column, are then integrated into the index. Make sure you’ve got that “id” column created. Finally, we pen down the index to the Faiss index file using the magical command faiss.write_index.

To understand this more readily, assume a library containing lots of books. Think of the below code snippet as an efficient librarian. If there’s already a catalog, it fetches it; if not, it builds one. The librarian tidies up the books, ensuring they’re all on the same page, assigns each a unique ID, like library cards, and organizes them. This way, when you ask for similar books later, the librarian can quickly find the right ones. It’s like a well-organized system for hassle-free book searches.

if os.path.exists(faiss_index_file):
index_content = faiss.read_index(faiss_index_file)
else:
content_encoded_normalized = faiss_title_embedding.copy()
faiss.normalize_L2(content_encoded_normalized)
index_content = faiss.IndexIDMap(faiss.IndexFlatIP(len(faiss_title_embedding[0])))
index_content.add_with_ids(content_encoded_normalized, np.array(pdf_subset.id.values).astype("int"))
faiss.write_index(index_content, faiss_index_file)

The search_content function takes the input query, pdf to index, and top-k results as input parameters and returns the results, which we will render onto the webpage.

def search_content(query, pdf_to_index, k=5):
query_vector = model.encode([query])
faiss.normalize_L2(query_vector)

top_k = index_content.search(query_vector, k)
ids = top_k[1][0].tolist()
similarities = top_k[0][0].tolist()
results = pdf_to_index.loc[ids]
results["similarities"] = similarities
return results

Flask Web Application

Now, let’s craft some fundamental code for our Flask web application, which features two routes. The first, our home route, renders the index.html page. Meanwhile, the second, our search route, works to present the results on the result.html page.

Note: It is up to you to make the fronend as interactive as you can. Just for starter I have created a basic working web app for now.

@app.route('/')
def home():
return render_template('index.html')

@app.route('/search', methods=['POST'])
def search():
query = request.form['query']
results_df = search_content(query, pdf_to_index, k=5)
return render_template('result.html', tables=[results_df.to_html(classes='data')], titles=results_df.columns.values)

if __name__ == '__main__':
app.run(debug=True)
Fig 3: Index page
Fig 4: Search results

What’s next? Dive into new datasets, and tweak the encoders. Share your journey with fellow coders and keep up with the continuous learning.

Update: I will publish an extended version of this code with an LLM prompt wrapper which will wrap this module and act as a contextualized question answering pipeline.

Email me if you have any queries or suggestions: sumit.atlancey@gmail.com | Github

--

--

Sumit
Operations Research Bit

Machine Learning Engineer who is passionate about building AI products