How to Build a Retrieval-Augmented Generation (RAG) System💡

Learnings from a real-world RAG application

Lars Wiik
10 min readMar 20, 2024

Have you ever wondered how the giants of the tech world manage to keep their Large Language Models (LLMs) up-to-date in our rapidly changing digital landscape?

The secret lies not in constant retraining, but in an innovative solution known as Retrieval-Augmented Generation (RAG).

Throughout 2023, I had the opportunity to build a RAG system from the ground up, collaborating closely with a team of skilled machine learning engineers.

This venture has provided me with a wealth of knowledge on the benefits and challenges associated with RAG — which I’m eager to share!

RAG illustration generated by ChatGPT

Introduction to RAG 🌟

The concept of enhancing Large Language Models (LLMs) using custom knowledge has generated considerable excitement across the tech community.

A significant breakthrough in this realm is the development of RAG systems, which represent a leap forward in our ability to update LLMs efficiently.

By merging a dynamic knowledge base with a retrieval mechanism, RAG systems can refresh the LLM’s accessible information swiftly, bypassing the need for comprehensive retraining.

But before we dive into how to build such a system, let's take a quick look at the short history of RAG.

Understanding the history of RAG is crucial not only for appreciating its current capabilities but also for grasping the rapid pace of innovation in artificial intelligence.

A Quick Look at the History of RAG 🔍

In May 2020, researchers at Facebook AI (now Meta AI) released a groundbreaking paper that marked the dawn of a new era in knowledge integration with language models.

The idea of this new system was to search through an up-to-date knowledge base to find relevant information before generating a response using language models to make the model’s response more accurate, factual, and up-to-date.

However, LLMs and RAG were not mainstream back in 2020. It wasn't until months after ChatGPT’s release on the 30th of November 2022 that the interest for RAG started to rise.

The diagram below shows the Worldwide Google Trends for “Retrieval Augmented Generation” from 2020 to 2024.

Note the upward trend starting to gain momentum in May of 2023.

Worldwide Google Trends for “Retrieval Augmented Generation” from 2020–2024

RAG — Explained 💡

The idea of a RAG system is to enhance a general large language model with additional context — allowing the LLM access to read relevant specific information before replying to the user.

Large Language Models often struggle with providing current or context-specific answers, as they’re trained on past data. RAG addresses this by pulling in the latest information from a wide range of sources in real-time.

You can think of RAG as a support person who quickly googles your issue and reads the top 10 articles of related issues to offer you an up-to-date and informed answer, all in real time.

RAG systems can revolutionize various sectors, from customer service bots to educational platforms offering personalized learning experiences, and even the financial sector by enhancing analysis tools that utilize the latest market data for more accurate forecasting.

We can simplify a RAG system into three main components:

  • Knowledge Component: The Knowledge Component is essentially the system’s database. It is a data storage where all relevant information is located. This can be information based on documents, news articles, scientific journals, web pages, PDFs, etc.
  • Retrieval Component: The Retrieval Component searches the Knowledge Component to extract the most relevant information based on the user query. The Retrieval Component is crucial for the efficiency and accuracy of the RAG system, as it determines what the Text Generation Component can see.
  • Text Generation Component: The Text Generation Component is usually a large language model responsible for converting the user query and the retrieved knowledge into a meaningful response in natural language. The LLM is prompted to read the retrieval context in addition to a specific instruction set according to the case at hand.

The query from the user is known as the query parameter.

When designing a RAG system, it is crucial to formulate the query parameter format since it will be used for context retrieval.

Here are some examples of query parameters:

  • Chat history
  • A single short query question
  • A long complicated paragraph
  • A set of keywords
  • etc…

Let’s take a look at an example scenario

Consider a scenario where a shopper navigates to an online e-commerce site and initiates a chat with the website’s chatbot to inquire about the availability of free shipping.

The query parameter in this case can be formulated as a chat history. For now, let’s assume there is only one question:

“USER: Do you offer free shipping?”

Given this question, the RAG system will convert the conversation into a standardized format optimized for the retrieval component.

This transformation might involve converting the text into an embedding — a lengthy vector of numbers that encapsulates the essence of the text.

The Retrieval Component then searches the Knowledge Component to identify and rank the top N most relevant documents. These documents are known as the Retrieval Context.

The Retrieval Context might include pages like the return policy and the Frequently Asked Questions (FAQ) section in our scenario.

Upon completing the retrieval phase, the system feeds the chat dialogue and the Retrieval Context into the Text Generation Component (LLM) along with our specific instruction set.

# INSTRUCTIONS
Act as a support person for an ecommerce shop.
Read the retrieved context.
Answer the user question politely.

# RETRIEVED CONTEXT
...
...
...

# CHAT
Do you offer free shipping?

The LLM with generate a response based on the prompt, which will then be passed back to the user within the chat.

The result is that the user received a relevant and informative response, grounded in the specific details pulled from the Retrieval Context.

This process not only ensures accuracy but also significantly improves the user experience by providing answers that are tailored to their specific inquiries.

RAG illustration generated by ChatGPT

Designing a RAG System 🎨

It is critical to define the system’s capabilities and limitations prior to architecting a RAG system.

Here is a list of things to consider:

  • File type support: What data types do you want to support? Only text? Long documents or only paragraphs?
  • The data storage dilemma: Where are you storing your data?
  • The choice between databases: Should you use a vector database or not?
  • Data management tactics: How do you update and delete items within your storage component?
  • Confidentiality concerns: Is there any secret information within your stored data that might be exposed?
  • Synchronizing the retrieval component: How will the retrieval component handle updates of your storage component?
  • Assessing scalability: How scalable will your storage and retrieval component be?
  • Speed of access: How do you ensure fast retrieval?
  • Retrieval configurations: How many documents should we fetch per retrieval?
  • Language support: Will you support only a single language or will it be multilingual?
  • Performance metrics: How do you evaluate the performance of your RAG system?
  • Limiting factors: What is the context limitation for our LLM? And how can you ensure your prompt is within the context windows?
Illustration of long-term planning

Practical Example of a RAG stack 🌟

Let’s create an example of a RAG stack and see what technical questions arise.

For the sake of simplicity, let’s use AWS S3 for our Storage Component, ElasticSearch for our Retrieval Component, and OpenAI’s GPT4 for our Text Generation Component.

ElasticSearch: I will not go over the details of ElasticSearch for now. All you need to know is that we can store and retrieve data from ElasticSearch — And to save us even more time, it also allows for custom embeddings similarity search!

Here is our example RAG stack:

  • Storage component: AWS S3.
  • Retrieval component: ElasticSearch.
  • Text Generation Component: OpenAI GPT4.
  • Supported files: PDFs, txt files, word documents.
  • Retrieval method: Embedding search.
  • Multilingual: Yes.

With this configuration in place, considerations now turn toward operational oversight.

Data Storage

Since we want to be able to process PDFs, TXT files, and Word Documents, we would to need create three transformation components — one for each input type.

The output of each component will be preprocessed text divided into chunks, which is inserted into ElasticSearch. While the raw input files themself will be stored in our AWS S3 bucket.

With this setup, we can overwrite the raw data files, recreate the preprocessed files based on the new files, and update ElasticSearch on the fly.

Retrieval

Selecting an appropriate model for multilingual embeddings is crucial for our indexing and retrieval strategy.

To design for multilingual embeddings search, we can leverage an open source model such as bert-multilingual-passage-reranking-msmarco from Huggingface. As noted in this model’s description:

“Purpose: This module takes a search query [1] and a passage [2] and calculates if the passage matches the query. It can be used as an improvement for Elasticsearch Results and boosts the relevancy by up to 100%.”

You can think of a query as a user question to the system, while a passage as a longer text of raw information.

Note that the distinction between matching a query to a passage and a passage to a passage is relevant for optimizing the search system:

  • Query-to-Passage Matching: This process involves finding the most relevant passages (long text chunks) that answer or relate to a user’s search query.
  • Passage-to-Passage Matching: This involves comparing two pieces of text to identify similarities or differences, it is useful for tasks like document deduplication, similarity detection, or content recommendation.

Text Generation (LLM)

Determining the optimal number of documents to retrieve per query and ensuring that our context does not exceed GPT-4’s context window are important operational concerns that must be addressed and planned for.

For our retrieval system, we can decide to fetch the top 100 paragraphs from ElasticSearch as an initial filter.

Thereafter, we should remove the worse matching paragraphs based on the number of remaining tokens we can use in our LLM prompt.

For this, we can use a library from OpenAI called tiktoken that counts the number of tokens given a text string.

You can install Tiktoken with pip by running:

pip install tiktoken

To get an understanding of what a token really is, we can use the following rule of thumb:

OpenAi Token: 1 token ~= 4 chars in English. And 1 token ~= ¾ words. 100 tokens ~= 75 words.

However, to ensure that we can fit our 100 paragraphs in the context window, we can optionally make sure to select an LLM with a very large context window, such as gpt-4-turbo-preview which supports a context window of 128,000 tokens.

Now we have set up a fully functional RAG system running on ElasticSearch with custom embeddings using GPT4.

This setup ensures not only efficient handling of diverse document types but also enables high-quality, multilingual content generation and retrieval.

There are a bunch of other challenges that I have not mentioned related to. Some of which include Web Scraping, LLM grounding, LLM hallucination, and Prompt Injection Protection.

And maybe the worst one, LLM Evaluation and Retrieval Evaluation.

ChatGPT’s illustration of the challenges related to RAG — Note ChatGPT’s spelling imperfection

However, this article is already getting long, so this part would have to wait for another time.

Conclusion

In summary, RAG systems enhance the capabilities of large language models by integrating them with a dynamic, searchable knowledge base.

As these systems continue to evolve, they will undoubtedly play an increasingly important role in how we interact with information and digital assistants.

The discussion about RAG’s future role in AI is ongoing, with some people favoring the advantages of fine-tuning models instead.

The debate surrounding RAG’s future role in AI is ongoing, with some advocating for the superiority of fine-tuning models.

However, based on my extensive work with RAG, I can vouch for its unique advantages, such as immediate knowledge updates associated with limited costs compared to fine-tuning approaches.

Moving forward into 2024 and 2025, I anticipate the rise of businesses specializing in providing RAG as a Service due to its complexity and its usage becoming more mainstream.

I will end the article by encouraging you to keep an eye on how RAG evolves throughout 2024 and into 2025. Perhaps you’ll witness firsthand how AI will transform information retrieval forever.

And do not hesitate to reach out if you have any questions!

Through my articles, I share cutting-edge insights into LLMs and AI, offer practical tips and tricks, and provide in-depth analyses based on my real-world experience. Additionally, I do custom LLM performance analyses, a topic I find extremely fascinating and important in this day and age.

My content is for anyone interested in AI and LLMs — Whether you’re a professional or an enthusiast!

Follow me if this sounds interesting!

Connect with me:

--

--

Lars Wiik

MSc in AI — LLM Engineer ⭐ — Curious Thinker and Constant Learner