🤯AI: Context Stuffing and RAG Patterns in Prompt Engineering

Zoiner Tejada
4 min readMay 8, 2023

--

A core pattern that unlocks powerful capabilities from large language models.

[This article is a part of the 🤯AI series]

In a different post, Memory & Recall in Large Language Models, we introduced the concept of context stuffing, which as the name implies gives the LLM more context (usually knowledge that is more recent than the data upon which it was trained).

Context stuffing is a useful pattern in certain scenarios, but it comes with limitations, namely an upper limit on the amount of text that can be provided as context. There are two factors to this limit, one is the hard upper limit on the number of tokens that can be sent to the model, and the other is a practical one- the cost of processing the tokens. The issue with cost is fairly obvious (more tokens increases cost), so we’ll delve into the Max Tokens issue here.

49.152 pages

The largest version of GPT 4 is called “gpt-4–32k” and according to the documentation it supports a maximum of 32,768 tokens.

OpenAI Documentation for GPT 4 (https://platform.openai.com/docs/models/gpt-4)

OK, so what does this limit mean in practical terms. Let’s do some simple approximations to get a perspective of just how much text this supports.

According to the OpenAI documentation, a rough rule of thumb is that 1 token is about 0.75 words of English text.

  • So, this means a maximum of about 24,576 words.
  • Next, assume atypical page of text has about 500 words.
  • Dividing 24,576 words by 500 words/page gets us 49.152 pages.

So that’s the upper limit. If you can squeeze all your context into 49.152 pages of data great. But as you start to combine lots of text to “catch up” your model’s knowledge, as we explain with the 50 First Dates Metaphor, with any form of conversational history (e.g., any previous interactions and results you need to provide the model to keep a dialogue going) and you will likely find you quickly exhaust this limit.

Importantly, you don’t get to use all of these 49.152 pages for your context, as the Max Tokens also has to include the number of tokens used by the generated completion result.

Context Stuffing is an Optimization Problem

This gets us to an interesting realization. If you can’t put everything in the context, how do you choose what to include in the context?

This is where we like to think of context stuffing as an optimization problem. From all of the possible context, and you somehow search thru that candidate data and provide as context only the most relevant text?

This is the essence of the Retrieval Augmented Search (or RAG) pattern.

Retrieval Augmented Search

With Retrieval Augmented Search, the core idea being a loose interpretation of this research paper, you use some external mechanism to select the text to include in context.

So, our prompt structure can be something like this:

Example prompt structure

And our question, might be something like this:

Example user question

But because the underlying data used to train the model dates back to 2021, it won’t have an up-to-date answer for this. For the context we need to find some text, say a recent news article, that provides the most up to date context. We can do something like full text search for terms matching the question. Typically, this is done with vector search (aka an embeddings database) using the vectorized form of the query to search a database of vectorized documents, and the text document behind the vector that most strongly matches is retrieved and used as context.

Illustration of processing pipeline using RAG (source: author/Solliance)

Clearly, this helps make significantly better use of whatever number of tokens are useable by the model.

Crazy🤯.

--

--

Zoiner Tejada

CEO Solliance | Entrepreneur | Investor | AI Afficionado | Microsoft MVP | Recognized as Microsoft Regional Director | Published Author