LLM-Hands-On: Implementing a Vector Library in Python (RAG)

Filipe Pacheco
5 min readOct 17, 2023

--

Image from Heiko Hotz, available in the link.

As you may have noticed in my first post on Medium, I’m currently focused on upskilling in Data Science. I have been working as a Data Scientist since 2021, and during this time, many things have changed. The IT world evolves at a rapid pace, as you may already know.

Today, I would like to share an easy implementation of a Vector Library with you. This is the first of many upcoming posts where I will showcase practical use cases of LLM, going beyond the hype of Chat-GPT. While the post may appear lengthy, the inclusion of code snippets ensures a quick and easy read.

Vector Library

In my previous post, “What I learned in my first 60 days working with LLM?” I discussed how the Vector Library is a valuable tool for LLM (Large Language Model) to enhance its responses. By utilizing the RAG (Retrieve Augmented Generation) approach, the Vector Library acts as a guide, enabling LLM to find answers within specific contexts. One major advantage of implementing RAG is that it allows for the use of smaller LLM models. Rather than training a massive model like ChatGPT 3.5/4 to contain all knowledge, you can direct the answer within a defined scope.

For this implementation, I’m utilizing the FAISS package. FAISS, which stands for Facebook AI Similarity Search, is an open-source library developed by Facebook’s AI research team (FAIR). This package leverages mathematical techniques to effectively work with high-dimensional vector spaces. Its primary purpose is to facilitate similarity identification, making it suitable for tasks such as text search and recommender systems.

Python Code

Now, let’s delve into the code. You can access my Github repo through this link to explore my experiments with LLM. Each folder represents a different experiment, accompanied by its own requirements.txt file, which you can use to reproduce the experiment. Additionally, the relevant data is always located in the same folder for easy reference. To ensure a clear understanding, I have organized the code details based on the classes utilized in the implementation.

Import Libraries

I’m utilizing the pandas library to read an Excel file that contains 20 examples of car problems and their corresponding solutions. For creating the faiss vector library, I’m using the numpy library. Therefore, the faiss library is imported. Finally, I’m incorporating the SentenceTransformer library to leverage LLM from HuggingFace and convert the natural language into vectors, enabling advanced language understanding.

# Import Libraries

import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

Load LLM Class

In this class, I’m solely importing the LLM model from HuggingFace. Specifically, for this experiment, I have utilized the all-MiniLM-L6-v2 model. The class is initialized by specifying only the model name.

# Load LLM Class
class LoadLLM:

def __init__(self, model_name):

self.model_name = model_name
self.model = SentenceTransformer(self.model_name)

def main(self):

return self.model

Load File Class

In this class, I load the Excel file as a pandas dataframe, which consists of 20 car problem examples and their corresponding solutions. Furthermore, I create a new column in this dataframe called “DESCRIPTION.” This column is designed to contain the complete context of information, encompassing both the problem and its associated solution within a single column.

# Load file Class
class LoadFile:

def __init__(self, file_path):

self.file_path = file_path
self.df_raw = pd.read_excel(self.file_path)

def __problem_solution_creation(self):

self.df_raw['DESCRIPTION'] = "The problem is identified as " + self.df_raw['PROBLEM'] + ". The possible solution is: " + self.df_raw['SOLUTION']

def main(self):

self.__problem_solution_creation()

return self.df_raw

Data Treatment Class

I implemented this class to handle the pandas dataframe for further processing. In order to utilize faiss effectively, the pandas dataframe needs to be in a specific format. To meet this requirement, I created a new column called “ID” using this class, and then reset the index based on the ID values. This step is crucial for running similarity searches using the RAG approach. In other words, when using LLM, it will provide you with the ID of the most relevant documents.

# Data Treatment Class
class DataTreatment:

def __init__(self, df_raw):

self.df_raw = df_raw

def __index_creation(self):

self.df_index = self.df_raw.copy()
self.df_index['ID'] = np.arange(len(self.df_raw))
self.df_index = self.df_index.set_index("ID", drop=False)

def main(self):

self.__index_creation()

return self.df_index

Vector Library Creation Class

This is the largest class in the code, as its main objective is to create the vector library itself. The code contains specific commands to enable the creation of a faiss vector library with ID, which facilitates identification when generating responses to the user.

The __export_faiss_vector_library(self) method is placed separately in the code, considering the operationalization of the application. This approach allows for creating one code segment dedicated to creating the Vector Library and another one focused on conducting searches. This separation is important because when using this approach, the library must always remain up to date with the historical data.

# Vector Library Creation Class
class VectorLibraryCreation:

def __init__(self, model, df_index, vector_library_path):

self.model = model
self.df_index = df_index
self.vector_library_path = vector_library_path

def __encoding_vectors(self):

self.faiss_content_embedding = self.model.encode(self.df_index.DESCRIPTION.values.tolist())

def __generate_embedding_vector(self):

self.id_index = np.array(self.df_index.ID.values).flatten().astype("int")

# Normalize the content
self.content_encoded_normalized = self.faiss_content_embedding.copy()
faiss.normalize_L2(self.content_encoded_normalized)

# Creation of the Vector Library
self.vector_library = faiss.IndexIDMap(faiss.IndexFlatIP(len(self.faiss_content_embedding[0])))
self.vector_library.add_with_ids(self.content_encoded_normalized, self.id_index)

def __export_faiss_vector_library(self):

faiss.write_index(self.vector_library, f"{self.vector_library_path}.faiss")

def main(self):

self.__encoding_vectors()
self.__generate_embedding_vector()
self.__export_faiss_vector_library()

return self.faiss_content_embedding, self.vector_library

Search Content Class

This class is responsible for receiving a query, which represents the user input, and returning the most relevant documents related to that query. By default, the search method retrieves the top 5 similar documents. The output of this method is a pandas dataframe that contains the most relevant documents, enhanced with a new column named “SIMILARITY.” This column assigns a float value ranging from 0 to 1, indicating the level of similarity in numeric terms.

class SearchContent:

def __init__(self, df_index, model, vector_library_path):

self.df_index = df_index
self.model = model
self.vector_library_path = vector_library_path

def search(self, query, k=5):

self.query_vector = self.model.encode([query])
faiss.normalize_L2(self.query_vector)

self.top_k = vector_library.search(self.query_vector, k)
self.similarities = self.top_k[0][0].tolist()
self.ids = self.top_k[1][0].tolist()
self.results = self.df_index.loc[self.ids]
self.results['SIMILARITY'] = self.similarities

return self.results

Run application

The final section of the code consists of the commands to invoke the classes, along with an example illustrating how to find the most relevant documents. In this particular example, I have requested content related to the keyword “Battery.” You can refer to the accompanying image below, which showcases the output of this search.

# Run application
if __name__ == "__main__":

model = LoadLLM("all-MiniLM-L6-v2").main()
df_raw = LoadFile('problem_solution.xlsx').main()
df_index = DataTreatment(df_raw).main()
faiss_content_embedding, vector_library = VectorLibraryCreation(model, df_index, "faiss-vector-library/vector_library").main()

search_content = SearchContent(df_index, model, "faiss-vector-library/vector_library")
result = search_content.search('Battery')
print(result)
Output with the five most relevant cases related to “Battery”.

Conclusion

In conclusion, the implementation of a Vector Library enhances the efficient handling of high-dimensional data. By utilizing tools like faiss and techniques such as RAG, it becomes possible to use simpler and smaller LLM models, expanding the possibilities within HuggingFace.

The integration of libraries like pandas, numpy, and SentenceTransformer enables advanced text search and analysis, facilitating recommender systems, information retrieval, and language understanding. This unlocks new opportunities in data analysis and natural language processing.

Stay tuned for more posts to learn further about implementing LLM. Follow me to stay updated!

--

--

Filipe Pacheco

Senior Data Scientist | AI, ML & LLM Developer | MLOps | Databricks & AWS Practitioner