Retrieval-Augmented Generation (RAG) model to answer questions based on information from Stephen Hawking’s “A Brief History of Time” and “The Universe in a Nutshell”

4 min readApr 29, 2024

Introduction:

In today’s data-driven world, the ability to efficiently extract relevant information from vast amounts of textual data is crucial. Moreover, with the advancements in Large Language Models (LLMs) and natural language processing (NLP), we now have powerful tools at our disposal to not only retrieve information but also generate responses based on user queries. In this article, we’ll explore a comprehensive approach to enhancing information retrieval and response generation using AI techniques

Need For Retrieval-Augmented Generation (RAG)

Traditional generative models are excellent at generating coherent and contextually relevant text based on the input they receive. However, they cannot access and incorporate specific information from external knowledge sources. On the other hand, retrieval-based models, like dense retrieval models or traditional information retrieval systems, are proficient at retrieving relevant information from large corpora but struggle with generating coherent and fluent text.

Roadmap:

PDF Processing:
The First step is extracting information from PDF documents. Using Python libraries such as PyMuPDF (Fitz), we can extract text from PDF files. Additionally, we clean and format the extracted text, preparing it for further analysis. This step is essential for ensuring the accuracy and quality of the extracted information.

def text_formater(text:str) -> str:
    return_text = text.replace("\n"," ")
    return_text = return_text.lower()
    return return_text

def open_and_clean_pdf(path1:str,pages_and_text:list,page_number=0) :
    pdf = fitz.open(path1)
    for pages_no,pages_text in tqdm(enumerate(pdf)):
        text = pages_text.get_text()
        text = text_formater(text)
        pages_and_text.append( {
            "page_number":pages_no - page_number,
            "page_char_count": len(text),
            "number_of_tokens":len(text)/4,
            "pages_sentence_count":len(text.split(".")),
             "page_text":text,
            "Book_Name":path1.replace(".pdf"," ")
        })

The next step is processing the text and converting it into sentences. I have used “Sentencizer” a pipeline from Spacy

from spacy.lang.en import English

nlp = English()
nlp.add_pipe("sentencizer") 
for item in tqdm(pages_and_text):
    page_text = item['page_text']
    item['pages_sentences'] = page_text.split('.')
    doc = nlp(page_text)
    spacy_sentences = list(doc.sents)
    item['pages_sentences_spacy'] = [str(sen) for sen in spacy_sentences]
    item['spacy_sentence_count'] = len(item['pages_sentences_spacy']) what is this

After the tokenization of sentences, we convert them into chunks

import re

pages_and_chunks = []
for item in tqdm(pages_and_text):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo 
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters
        
        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

Now we convert the chunks into vectors using a sentence transformer

What are Vectors or Vectorization ??

Vectorization in Natural Language Processing (NLP) is a method used to convert text data into a numerical representation that Machine Learning algorithms can understand and process. It involves transforming textual data, which is unstructured, into a structured format that facilitates improved data analysis and manipulation.

I have used the all-mpnet-base-v2 Model

from sentence_transformers import SentenceTransformer,util

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                     device='cuda')

for item in tqdm(pages_and_chunks):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"],batch_size=64)

all-mpnet-base-v2 has a max input token size of 514 and a output vector size of 768

Now we are building our RAG pipeline

def retrival_query_resources(query:str,
                             embeddings: torch.tensor,
                             model: SentenceTransformer=embedding_model,
                             indices_to_return:int = 5):



        query_embedding = model.encode(query,convert_to_tensor=True)
        start_time = timer()
        dot_scores = util.dot_score(query_embedding,embeddings)[0]
        end_time = timer()
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings :{end_time - start_time}")
        scores,indices = torch.topk(dot_scores,k=indices_to_return)
        return scores,indices

def print_query_results(query:str,embeddings,pages_and_chunks=pages_and_chunks):
    
    scores,indices = retrival_query_resources(query=query,
                                              embeddings=embeddings)

    for score,idx in zip(scores,indices):
        print(f"Score:{score*100:.2f}")
        print("Text:")
        print(pages_and_chunks[idx]['sentence_chunk'])
        print("\n")

The “print_query_results” func takes the input query from the user and the pdf data and passes it to the “retrival_query_resources” where it performs cosine similarity on the user query and the pdf data embeddings and returns top 5 indices

What is Cosine Similarity

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

Instead of Cosine Similarity, we could also use Faiss which uses nearest-neighbor search implementations for billion-scale data sets that are some 8.5x faster than the previously reported state-of-the-art, along with the fastest k-selection algorithm on the GPU known in the literature.

Conclusion

Before we conclude, I would like to extend my sincere gratitude to Mr. Daniel Bourke for his tutorial that inspired this article.

His youtube channel:https://www.youtube.com/@mrdbourke

Github_Link to the complete code:https://github.com/SahilJain8/RAG-PIPELINE