Retriever-Augmented Generation (RAG) with LLMs -1

Riya Joshi
PAL4AI
Published in
7 min readNov 6, 2023

Authors: Riya Joshi, Parin Jhaveri

This article is the first in the series where we teach you how to build RAG pipelines using LLMs. This article will introduce you to certain concepts required to understand what RAG pipelines are. Future articles will walk you through the code that will help you build your own pipeline.

Large Language Models (LLMs)

Over the past year, Large Language models have proved to be incredibly powerful in various natural language tasks. These models are based on transformer architecture and contain billions of parameters (GPT-3 and GPT-3.5 have about 175 billion parameters; the number of parameters for GPT-4 is unknown). They are trained on an extensive corpus of articles, Wikipedia entries, books, and other internet-based resources to generate responses similar to those of humans. This scaling up of parameters and training on a vast corpus has significantly enabled LLMs to outperform previous language models like BERT.

In fact, this has led to a new paradigm in NLP — prompt engineering. Rather than fine-tuning the model parameters for a particular task, all you have to do is type out a natural language instruction (prompt) as an input to the LLM that describes a particular task.
This allows anyone and everyone to prompt the LLM to help them solve certain tasks — such as summarization, paraphrasing, etc. Designing the perfect prompt to extract desired outputs from LLMs is called prompt engineering.

Applications of LLMs

Many companies are now utilizing LLMs due to their diverse applications, such as:

  • Microsoft has integrated GPT 3.5 and GPT 4 models in many of its products, including BingAI, a chatbot on the Bing search engine. This enables users to have conversational interactions and revolutionizes the way we search. They also integrated it with GitHub to provide auto-complete coding suggestions.
  • Google has developed its own chatbot BARD and it plans to integrate it with search soon.
  • Udacity has integrated GPT-4 into its online course platform to provide personalized guidance and explain concepts to students through a virtual tutor.
  • Grammarly, the online writing assistant, has integrated GPT-4 for improved writing suggestions.

One of the major applications of LLMs is for building Chatbots. OpenAI ChatGPT models (conversational specific LLMs based off of GPT-3 and GPT-4) do really well on answering questions on data that they have seen during their pre-training phase (such as general information present in Wikipedia) as they have in some way “memorized” this information.

Now, think of a scenario where you work at a legal firm and wish to build a chatbot that promptly responds to your customers with questions about their particular legal case. This data would be private and as ChatGPT has never seen this data before, it wouldn’t be able to answer questions pertaining to the legal case that the customers would be asking for.

Could we re-train or fine-tune ChatGPT to learn this data? Sure!! That’s if you have tons of dollars to spare!

Fortunately for us, NLP researchers have found a better way to deal with this problem — Retrieval Augmented Generation (RAG).

Retrieval Augmented Generation (RAG)

RAG allows you to use internal data or new data that ChatGPT may or may not have seen to generate answers to queries. This helps companies build conversational agents or chatbots on top of their proprietary and private data (think back to the legal firm case).

Overview

Fundamentally, the RAG pipeline is designed to do the following:

  1. Create Vector Embeddings of the Documents
  2. Given a query/question from the user, create its vector embedding (query embedding)
  3. Retrieve the top-k most semantically similar documents to the question by using a similarity function such as dot-product or cosine similarity of the query embedding and each document embedding.
  4. Include the query and text from the top-k documents from step 3 in the prompt to an LLM. The response from the LLM would be the answer to the query.

Components of RAG

Let us discuss the components of the RAG pipeline in more detail using the diagram below. Throughout this series of articles, we will utilize OpenAI Language Models (LLMs). You can use other closed and open-source LLMs such as Claude-2 and Llama2–13B respectively as well.

A typical RAG pipeline

For any RAG pipeline, we have at least these 3 basic components:

1. Data Ingestion

As part of this component, we preprocess, chunk, and create vector embeddings of the data so that it is available to be used as part of the RAG pipeline.

The data can be any company-specific internal dataset over which you need to build a Conversational AI solution, or it can be newer and updated data that is not part of the LLM’s implicit “memory”. Company-specific data is often not readily consumable. You may need to spend a considerable amount of time to preprocess and clean the data to bring in the required format.

Once the data is preprocessed, you will need to chunk the data. Yes, I know this is the third time we have used the word “chunk” without explaining it to you. Don’t worry! We got you!

Why chunk? Well, if you check the model configurations page, you will see that there is a context length defined. This means the number of tokens that can be included as part of a prompt to an LLM. Context length is model-dependent and can vary from 512 and less to 16K tokens and more. As we don’t know which tokenizer (a probabilistic model that splits words into smaller units called tokens) OpenAI uses, we will assume the smallest token to be one single character. GPT-35-Turbo, the LLM we will be using in this series, has a context length of 4K tokens.

If our data consists of more than 4K tokens, then to fit it in the prompt we need to break our data into smaller units or chunks. But how to determine our chunk size? This now becomes a design decision and is dependent upon the last step in the RAG pipeline — when prompting the LLM to answer the given query with the top-k most relevant documents (or in our case chunks).

The prompt consists of the following:

  • Instructions on how to answer any given query
  • Query given by the user
  • Top-k chunks that are most relevant to the user query

The length of the prompt should be less than 4K tokens. Based on the length of your instructions, an estimate of query length, and the number of topmost chunks you choose to retrieve (top-k), you can determine your chunk size.

Let’s say we choose to chunk to be of size 512 tokens. Once we chunk our data into sizes of 512, the next step of the data ingestion pipeline is to vectorize each chunk. This is necessary as during the retrieval process, we need to apply a similarity function between the vector embeddings of the chunks (we use document and chunk of data interchangeably) and the query embedding.

For this series, we will use the OpenAI Vector Embedding API, to create vector embeddings of our documents

Once we have vector embeddings of our data, the next step is to store it somewhere where it can be accessed easily. Vector stores are special databases that store embedded vectors and make searching over them much easier and faster. Some of the Vector DBs are Chroma, AWS OpenSearch, and Lance. For our project, we will use an In-Memory vector database, which makes use of the Facebook AI Similarity Search (FAISS) library for efficient similarity “searching” (searching for the top-k most relevant documents).

2. Retriever

When an input query comes in, it is embedded using the same embedding model — OpenAI Vector Embedding API in our case.

A similarity function such as cosine similarity or dot product is applied between the query embedding and each chunk embedding stored in the vector DB. In our case, we will use the dot product. The higher the value of the dot product, the more semantically similar and hence relevant the chunk is to the query. The top-k most semantically similar chunks are returned.

Here we have narrowed down are search space for the answer of this query from the whole set of documents to only the top ‘k’ relevant ones.

3. Reader

Once we have retrieved the top-k relevant documents to our input query, the next step is to generate a human-like response. As part of the prompt, we pass these top-k chunks of text along with the query and a set of instructions to a large language model, in our case GPT-3.5-turbo-16k.

The set of instructions consists of information on how you want your answer to look. For example: you can pass guidelines like — the answer should be in a specific format like bullets, avoid harsh language, start with polite greetings, return references of the answers (for example page number and other metadata), etc. We will get to more details in the next articles of the series.

Finally, the LLM generates a response based on this prompt.

Voila! Not too hard, right? For the most part, yes, but RAG also has its own challenges and we will discuss that in the following articles. In the next two articles, we will build a simple RAG pipeline from scratch, implement some guardrails to prevent hallucinations, and track conversation history.

Stay tuned!

--

--

Riya Joshi
PAL4AI
Editor for

Data and Applied Scientist @ Microsoft | MS CS UMass Amherst| NLP | Data Science | Machine Learning