Gentle Introduction to Retrieval Augmented Generation

Sarang Sanjay Kulkarni
9 min readApr 21, 2024

--

This is a second article in the article series on Retrieval augmented Generation. You can find the previous article here.

Introduction:

As a quick refresher for first part, Large Language Models (LLMs) have a remarkable ability of generating human-like text. Despite their advanced abilities, LLMs face challenges such as providing accurate, current, and contextually appropriate information, especially when dealing with proprietary or domain-specific content which they are not pre-trained on. To overcome these issues, knowledge-augmentation methods such as prompt engineering, Retrieval augmented generation (RAG) and fine tuning have been introduced. These methods enhance LLMs by integrating with external data sources or real-time information, thereby improving the precision and relevance of their responses for tasks needing up-to-date knowledge or specialised understanding.

In this article, We will cover theory of Retrieval Augmented Generation and then implement a simple RAG pipeline using Langchain, Python and Chroma in a follow up article.

What is Retrieval Augmented Generation?

Retrieval Augmented Generation (RAG) enhances the capabilities of LLMs by integrating “retrieved facts” from external sources to ground large language models (LLMs) on the most accurate, up-to-date information

You can consider asking questions to a RAG chatbot like an “open book” exam. Whenever asked a question to chatbot, it first goes through “book” and find relevant information and produces answer based on this information.

The easiest way to understand (as explained in previous article) is looking at below conversation. Here the question asked to chatGPT is beyond its knowledge cutoff date and hence it cant answer the question.

However, if you feed the relevant information to question directly in prompt as context, LLM is able to generate answer that is grounded in the context. In the following example, you can find that after providing the relevant context in the prompt, chatGPT is able to answer the question that it was not primarily trained on.

The RAG is a method that brings this “context” automatically based on the question asked. In any practical usecase, user will need to have such context automatically added into the prompt so that LLM’s responses are grounded in facts. So a simplified way to look at RAG architecture is like the one below

Idea behind RAG

RAG combines the best of both worlds: the depth of pre-trained language models and the precision of real-time data retrieval. This approach allows the model to generate more informed and contextually relevant responses by consulting external sources.

If you have read thus far, you already know the idea behind what RAG is. You will also know that “Context Retriever” is the most important in this architecture as it brings the relevant context for the question. So if this box fails to bring the right text chunks as context, chatbot wont be able to answer.

Here is how the RAG process unfolds

  1. A user poses a question to the chatbot.
  2. The retrieval system searches through a vast database to find relevant information.
  3. This information and the original user question is then handed over to the generative model to provide a well-crafted response.
Diagram 1.1 Simplified components in basic RAG system

As per above illustration, we can divide a RAG system into 2 sections.

  1. Retrieval
  2. Generation

In the next section, lets go in depth to understand this “Retrieval” component further.

Retrieval — Bring the relevant context:

Next, Lets understand how RAG finds the relevant information. Short answer to this question is “Semantic Search”. Lets explore what is semantic search and how is it different from traditional keyword search.

Semantic Search:

In a natural language interface, traditional keyword search engines can feel limiting, as they focus on literal word matches instead of the deeper meaning behind your query. Semantic search addresses this challenge, delivering more insightful and relevant results by understanding the intent and context of your search.

Key Elements of Semantic Search

  • Understanding Relationships: Semantic search recognizes the connections between words and concepts. A search for “yoga” may include results related to “meditation” or “mindfulness” due to their strong semantic similarity, while user search for Soccer may include results/documents for “football,” “FIFA,” and “sports,” even if the word “soccer” itself is not explicitly present in the documents.
  • Avoiding Misinterpretations: Semantic search distinguishes between similar-sounding words with different meanings (like “flower” and “flour”), preventing irrelevant results.
  • The Power of Context: If you search for “Apple,” semantic search infers whether you’re interested in the fruit or the company, depending on the context. If you type “Apple products”, it will not fetch fruits, tailoring results according to context.
  • Polysemy and Synonyms: If you search for “Bank,” the search engine retrieves results related to “financial institution,” “river bank,” and “banking services,” demonstrating its ability to handle polysemy (multiple meanings) and synonyms.
Source

How does semantic search work?

Embeddings: Technically, Semantic search relies on embeddings, detailed numerical representations that capture the meanings of words, phrases, and sentences. Embeddings are created using sophisticated machine learning models that analyse massive amounts of text data and convert them into numerical representation.

Examples of embedding models include text-embedding-3-small/large or text-embedding-ada-002, BERT etc.

Vector Database: These embeddings are stored in vector databases, where text with similar meanings are grouped closer together. Think of it like a giant map where related concepts are clustered together. In example below, you can see that the embedding of type animals and fruits are grouped separately.

Distance Calculations: When a query is presented to Vector Database, its embedding is calculated and compared to the embeddings of stored documents. The closest documents (typically measured using cosine similarity or other distance metrics) are considered the most semantically relevant to the query. In the following example, if we query for “Kitten”, the query will likely fall into group of “Animals” and we can bring k nearest neighbours which will be highly relevant to the semantic meaning of the query.

(source)

In a RAG chatbot, a large document is split into smaller chunks and then embedded into vector database. Like the example above, the chunks are then retrieved based on the shortest distance from query’s embedding.

In conclusion, Semantic search enable finding the most relevant information based on meaning and not just keywords. This becomes core idea behind the Retrieval Augmented Generation.

When we want to productionise our chatbot, using only semantic search may fall short. There are other methods to bring the relevant context such as

  • “Hybrid Search” — where both keyword and semantic search are combined
  • “Knowledge graph” — where relationships between data are explicitly mapped, facilitating a more structured retrieval of information. This approach is particularly effective in complex domains where understanding the interconnections between different entities enhances the relevance and accuracy of the responses provided by the chatbot.

Also almost always just semantic search may not work to bring all the relevant chunks which necessitates advanced retrieval techniques. All of these topics warrant separate article on its own.

Indexing

So far we have discussed about the Semantic search as the key idea behind retrieving the relevant context. However, to be able to retrieve this information, data needs to be organized in a way that makes it easily searchable. This is done through a process called indexing.

Below is the diagram that talks about the Indexing / Ingestion process.

Following are the steps that you need to take for indexing:

  1. Text Extraction: In most cases, an organization’s data resides in PDFs, PPTs, word docs, or other file formats. The first step is the extraction of text from these files. This is the most crucial step as any error in this step or corruption of data while extracting can lead to documents becoming non-searchable or showing up in irrelevant text searches, a phenomenon known as Garbage in — Garbage out.
  2. Text Chunking/Splitting: Once you extract text from these files, the next step is to split these documents into meaningful chunks. This ensures that we can pass many such chunks to the language model along with a question so that we get the relevant generated answer for the user query. “Meaningful” is the most important word here. Our responsibility is to make sure that the chunk on its own is able to convey a meaning and if we split a text in a way that has broken context, it may cause errors. There are various approaches to text chunking, including split by character, recursive splitting by character, and semantic chunking.
  3. Embedding & Vector Store: Once we have the right-sized chunks of texts, we embed each of these text chunks using embedding models. Once the Embeddings are generated, they are place in vector store. Popular vector stores include Weaviate, Qdrant, Pinecone Milvus and many others.

All these steps so far are data preparation steps and are a one time activity. By the end of this, you will have your data available in vector store ready to be retrieved based on the users query.

Generation

Generation is the final step in RAG pipeline.The retrieved context from previous stage is presented to LLM along with original query. Using proper prompt engineering technique, LLM provides response which is grounded in the context and in the format expected.

Since generation of answer happens on the basis of text chunks, it becomes easy to provide Citations in the response which further improves the trustworthiness of the answer as user can look up the right text chunks from where the information was retrieved.

Putting all these points together, a basic RAG chatbot can be visualised like below,

Real-World Applications of RAG

To bring the concept of RAG closer to home, let’s look at some real-world applications:

  • Customer Support Bots: RAG is the backbone of powerful customer service chatbots that can pull precise answers from extensive databases. These chatbots can answer a lot of repetative customer requests or complaints.
  • Search Engines: perplexity.ai is an excellent example of a RAG-based chatbot that provide real-time search results and instant answers to user queries. By leveraging this technology, users can receive accurate and up-to-date information without the need to manually search through multiple sources
  • Efficiency in Specialized Domains: In finance, healthcare, and legal sectors, RAG helps solve complex problems by integrating generative AI with domain-specific knowledge. In healthcare, being able to access the entire medical records of person precisely helps finding root causes to ailments very effectively. In Life science R&D, We @ ThoughtWorks have built a RAG based chatbot for our client that save a lot of time of scientist by being able to precisely find the information needed for their research which was impossible to find without such a chatbot. This is a boon to drug development.

Conclusion

RAG is a method that is reshaping the landscape of Gen AI chatbots. By understanding and utilizing RAG, you can create intelligent systems that not only answer questions but do so with an impressive level of detail and relevance.

In the next article, we will get our hands dirty with code and build a basic RAG chatbot from scratch. You can find the next article here

--

--