Empowering Albert with Retrieval Augmented Generation (RAG)

Rabia Eda Yılmaz
albert-health
Published in
7 min readApr 5, 2024

Table of Contents

  1. Introduction
  2. Retrieval Augmented Generation (RAG)
  3. Customer Support and FAQs with RAG
  4. Evaluation
  5. Conclusion

1. Introduction

In our previous article, we discussed how Albert utilizes large language models with disease-focused websites in the field of healthcare. However, large language models may encounter various challenges. One notable challenge is hallucination, which refers to a model generating incorrect or misleading information. This can affect the reliability and performance of the model in practice. To address this issue, a technique called Retrieval Augmented Generation (RAG) has become popular recently. RAG ensures fidelity to the information in the provided document by reducing the hallucination tendency of large language models, thereby producing more consistent and accurate outputs.

In this article, we will examine in detail how Albert uses large language models and RAG technique together to improve customer support and the process of answering frequently asked questions in our mobile application.

2. Retrieval Augmented Generation (RAG)

Hallucination is a challenge frequently encountered during text generation by large language models. This is because this situation can lead the large language model to produce texts that, although unrealistic, are convincingly crafted. The RAG techniques is an effective method developed to address this issue.

RAG reduces the risk of hallucination by ensuring that large language models adhere closely to text documents. By using information retrieval or semantic search methods, it selects the most relevant text snippets by establishing connections within and between texts. As a result, the text generated by the model becomes more consistent and reliable by adhering to the most relevant text snippet found.

We used the LangChain framework, which is preferred for developing applications that can make judgments based on documents supported and nourished by large language models and are knowledgeable about the subject, to implement this feature. This approach stands out as an important step to reduce consistency and reliability issues in text generation, allowing large language models to produce more reliable and higher-quality content.

3. Customer Support and FAQs with RAG

Figure 1. Albert Customer Support QA Pipeline with RAG

The first step in successfully implementing the RAG technique in Albert’s customer support operations is to choose an embedding model. An embedding model is an algorithm that helps represent texts as mathematical vectors. Representing texts with mathematical vectors is highly important in the fields of artificial intelligence and natural language processing.

Many algorithms have been developed for this purpose, and among the most commonly used ones are Word2Vec and GloVe. Word2Vec is an algorithm used to represent relationships between words in texts. This algorithm utilizes the surrounding words of a word to emphasize its context. For example, while the word “king” is often associated with words like “kingdom” or “throne,” the word “queen” can be associated with words like “princess” or “crown.” These relationships can be utilized to capture semantic similarities and relationships in texts. GloVe (Global Vectors for Word Representation), on the other hand, utilizes statistical information to generate word vectors. By directly measuring possible relationships between words, it captures connections between a word and other words more accurately. This method creates word vectors by utilizing statistical information obtained from large text datasets, resulting in a better representation of word meanings.

We prefer to use ada model one of the embeddings offered by OpenAI as a tool. The reason for this choice is its specialized design for measuring relationships between texts and its suitability for deep analysis of our data. Additionally, embeddings are a good option for various tasks such as searching, clustering, recommendation, anomaly detection, and classification. These features support Albert in better-understanding texts and achieving high performance in various natural language processing tasks.

When creating a vector database, chunk size, and chunk overlap hyper-parameters are determined. Chunk size is a term used during text processing or model training. It refers to dividing a text into pieces of a certain size for processing or training a model. For example, you can use 1000 words as a chunk size to divide a text into chunks of 1000 words each. Chunk size is important for determining how processing will be done in larger or smaller pieces and can affect performance. Overlap is a concept used during text processing or model training. Chunk overlap specifies how much each chunk overlaps with the next. For instance, when dividing a text into 100-word chunks, chunk overlap can be used to ensure each chunk overlaps with the previous one by a certain amount of words. This can help achieve more consistent and balanced handling of information during text processing or model training.

Later, a vectorial database is created using documents generated based on the experiences of the customer support team and existing documents. This database consists of vectors containing customer inquiries and their relationship to the documents. Each document text is transformed into a vectorial form using an embedding tool. These vector representations are utilized to measure similarities between documents and to select the most appropriate documents.

Afterward, the user query is also transformed into a vector using an embedding tool, and the vector of the nearest document fragment in the vector database is determined through mathematical calculation. There are different algorithms for vector proximity calculation. While Euclidean distance measures the shortest distance between two vectors, Manhattan distance calculates the sum of absolute differences. Jaccard similarity is particularly useful for sparse data because it measures the ratio of common elements to the total number of elements. Cosine similarity is ideal and widely preferred for sparse text data due to its insensitivity to scale changes and sensitivity to the directions of vectors. The closest fragment found using the selected cosine similarity algorithm is added to the designated location in the designed prompt. Then, this prompt is passed to the Large Language Model (LLM).

Figure 2. User Question Answering Flow Diagram

LLM, following the instructions specified in the prompt and staying faithful to the given document excerpt, must answer the question. This method ensures that the model remains consistent with a certain reality and increases traceability by revealing which document excerpt the given answer is based on.

RAG determines the text fragment most closely related to the user’s question. By feeding this fragment and the user’s question into a large language model, the LLM provides a more natural language response to the user’s question. This way, the risk of hallucination is reduced by using text fragments taken from the document. This process makes customer support operations faster, more effective, and more reliable.

4. Evaluation

LLM evaluating systems with metrics is an important way to track them objectively. Although there are ongoing efforts to evaluate RAG systems, there are some prominent tools that stand out. One of these tools, RAGAS, has categorized metric types into two: generation and retrieval. For generation, there is faithfulness, which measures how relevant and accurate the generated response is to reality, and answer relevancy, which measures how relevant the generated response is to the question. For retrieval, there is context precision, which measures the signal-to-noise ratio of the retrieved text, and context recall, which measures how much information the system can extract to answer the question given. Thus, the developed RAG system can be visually monitored and tracked.

5. Conclusion

The Retrieval Augmented Generation (RAG) technique is implemented in Albert’s customer support operations, and enables more effective and reliable answers to users’ questions. This technique begins with converting documents into a vectorized database using an embedding tool, and then selecting the most relevant fragment with user questions. Subsequently, a process involving adding the selected fragment to a prompt designed for an LLM is followed to answer the questions.

Apart from the RAG system described and implemented in this text, there are also advanced-level RAG systems. These systems include additional operations in the pre-retrieval, retrieval, and post-retrieval steps. Additionally, modular RAG systems exist, which incorporate techniques such as hybrid search exploration, sub-queries, and recursive retrieval.

Certainly, this method has some disadvantages. Its performance is highly dependent on the quality of the selected closest fragment. This technique requires existing data, such as a document. Additionally, there is a limited context length that the model can process, and a well-designed prompt is necessary to stay within this limit.

Albert’s constantly updated and improved user guide also plays a fundamental role in providing the assistance users need. Therefore, there is a process that requires ensuring that the current document is fed into the embedding model. Additionally, as the number of documents increases, a clean and well-designed process is required due to the impact on the sustainability and consistency of the system. Furthermore, sometimes the most relevant text fragment may not be precisely identified, negatively affecting the response given by the LLM.

In this article, the feature supported by the GPT model using the RAG technique, aimed at making Albert’s customer support operations more effective, reliable, and user-centric while reducing the risk of hallucination and increasing customer satisfaction, is described.

Until our next writing, take care!

--

--