Giving Large Language Models Context

9 min readMay 16, 2023

This article explores the advantages and disadvantages of providing context to Large Language Models to improve performance (instead of fine-tuning). It also explores the use of Vector Databases as a context information source.

Large Language Models (LLMs) are pre-trained on a massive amount of training examples at an extraordinary cost. This pre-training computes the underlying parameters (weights and biases) in the neural network in order to minimise differences (or losses) between the model’s prediction and the actual training example. The large number of weights (tens of billions+) and diverse training data sometimes allows the model to generalise well.

It is possible to take one of these models and fine-tune them to improve performance on a particular task by performing additional downstream training on new examples. Such fine-tuning is done on a relatively small set of new training examples meaning that cost and training time is also relatively small. The additional training examples are curated specifically for the new use case.

This could be done using a number of techniques. Some oversimplified examples of fine-tuning include:

Training on the new set of training examples by modifying the parameters (weights and biases) of each layer in the neural network.
Adding new output layers to the end of the neural network and freezing the parameters in the original layers. The fine-tuning would be faster because it would only modify the new layers. The original model would also be left intact allowing it to be used for other fine-tuning tasks.
Using efficient fine-tuning techniques such as Low-Rank-Adaptation (LoRA), where only subsets of the parameters are updated during the fine-tuning process.

Downstream fine-tuning is very powerful, since the original large neural network would have learnt general features which are then used to quickly learn new features related to your use case.

To understand the concept of how models learn general features it might be easier to look at convoluted neural networks used for computer vision. You can get a pre-trained image classifier trained on a large set of images. This neural network would have learnt features such as identifying edges in an image. In deeper layers more abstract features such as ‘eyes’ or ‘cars’ would have been learnt. Now imagine you need to fine-tune a neural network to detect defects on products in a factory production line. The pre-trained model can be fine-tuned quickly on a small sample of your product images. The more general features that it learnt beforehand (such as edge detection / recognising different materials) would be used to achieve good performance on the smaller training sample.

Image generated by AI using Open DALLE-2

In-Context Learning

An alternative method avoids fine-tuning the model and leaves the model’s weights unaltered. Instead specific training examples can be inputted into the model during the inference stage as prompts. This technique is sometimes referred to as in-context learning.

(In-context learning can also be used in conjunction with fine-tuning)

A simple example would be to include a sample of output format required into the prompt:

"I will be providing a list of real estate properties. 
Identify the properties that have more than one bedroom 
and output a list of selected properties in JSON format. 
Use the following sample format:

{
"Properties": {
"House1": {
"NoOfBedrooms:3": {},
"Size:200sqm": {},
"Price:100,000": {}
},
"House2": {
"NoOfBedrooms:2": {},
"Size:150sqm": {},
"Price:75,000": {}
}
}
}

In-context learning refers to a set of techniques in ML where multiple of inputs combined with example outputs (as pairs) are inputted into the LLM by prompting.

The rest of this article explores techniques for passing any type of context data to an LLM and not strictly in-context learning.

Context Length Limits

For most LLMs the context length limit for the prompt has been limited to a few hundred tokens at most.

A token length is typically 3/4 of an English word length (on average) and this depends on the tokenizer used. In addition the context length differs between different models and architectures.

Recently new models have started to increase the token context length limited significantly:

OpenAI GPT-4 was launched in beta with a 8k token limit
A version of GPT-4 with a 32k token limit is now being rolled out.
Anthropic has just released a version of Claude with a 100k context limit.

An 100k context limit translates into approximately 75k words for Claude (depending on model and tokenizer this estimate can vary for other LLMs).

To understand the significance of this, we can look at how much data can be represented by 75k words. Some shorter novels have less than 75k words. One example is Brave New World by Aldous Huxley which has 63k words. Another example is Nineteen Eighty-Four by George Orwell with 89k words.

Therefore you could theoretically input the entire text of Brave New World into a single prompt to Anthropic’s Claude LLM and instantly ask model questions about the novel. The responses would then be grounded to the novel’s text. (Note that this is an assumption — I haven’t tried this and one would need to be careful of potentially high API costs).

By visualising the size of the total word strings in such novels we can imagine what types of data can be used for context:

Full technical documentation of a framework / platform / software application.
Large subsets of the source code of a software application.
Large structured datasets.
Entire PDFs and other unstructured data sources.
Long legal contracts, as well as large collections of legal contracts.
Lengthy patient medical histories.
Student notes, essays and course documents.
Collections of research papers.
Large collections of software system logs, event logs and bug reports.
Daily news, weather reports, financial reports and earning calls.

It is interested to note that the data types listed above fall into two categories:

Private / corporate data which is not in the public domain and therefore would not have been used for training foundation LLMs.
Real-time / time-sensitive data which cannot be included in the LLM training data because it happened after the training cut-off date.

For both categories above, fine-tuning might not be ideal due to privacy concerns, costs and the constant changing nature of the data. In addition, fine-tuning requires advanced ML expertise and resources which many companies may not have access to. Therefore being able to use context data for these categories of data opens up important use cases for LLMs and makes them more accessible.

Once the LLMs have been grounded by passing this context data as a system / user message (in the prompt API call), then the user can ‘chat with the data’ and ask for summaries. The LLM has now been grounded and personalised temporarily and is able to reply to prompts which it has not previously seen in the pre-training data.

It is worth noting that even though the provided context is now being used to build responses, the underlying model has not actually learnt this context since it’s parameters have not been altered.

The problem with passing context via prompts

There is a significant issue with in-context learning using the current closed LLMs.

As mentioned above the the context data has not actually been learnt by the LLM. In addition requests to such models are stateless. Therefore LLMs require that the context data is sent back to the LLM with each prompt. There is no concept of maintaining a session, where all previous prompts are maintained within scope of that session. If we are sending 100k tokens with each prompt then the technique proves to be infeasible for most use cases.

Many of the cutting edge models offered by AI companies have a ‘per token’ fee imposed on consumption of their APIs. If a per token fee is also imposed on the prompts — then large context data with thousands of tokens will cause significant API usage fees.

Open source or self-hosted models may offer a more cost effective solution, but such lengthy prompts will still require significant compute resources.

Vector Databases as a context source

One solution could be to use vector database to store any context you would like to make available to an LLM. The vector database acts as an intermediary between the user / application and the LLM whenever data is required for context.

First of all, it is important to understand that embeddings can be used to convert text strings into a vector list, which can then be stored within a vector database. This article will not attempt to explain in detail what embeddings are and how they work since there are many excellent resources online that would do a better job of illustrating this.

An oversimplified explanation, is that embeddings are vector representations of input text strings which are extracted from an LLM hidden layer after training. The embeddings would ensure that these vectors are computed in a way that related text strings have a small distance between them, while unrelated strings have a large distance between them. Since the strings have been translated to embedding vectors the distance between the two vectors can now be computed (eg. Euclidean, dot product or cosine distance metrics). Therefore if we take single words as examples; ‘queen’ and ‘king’ would be plotted close to each other. ‘Female’ would then be plotted close to ‘queen’. This can be extended to sentences and longer phrases.

Note: Different embeddings could output different vector values for the same string, depending on the training and architecture of the model.

Vector databases are ideal because the LLM’s query can be used to perform a semantic search. In a semantic search the vectors which are close (in terms of distance) to the search query are returned using some sort of nearest neighbour algorithm. These neighbouring vectors represent the data that is most ‘related’ to the query. This contrasts to the keyword search that we are used to in search engines and SQL databases, which will only return data which has matching keywords.

The process to use vector databases as our context source

An over-simplification of this process would be to decompose the text data to be used for context into smaller chunks of data. For example a book would be divided into chapters, paragraphs or sentences. Tabular data could be decomposed into records or tables. News and weather forecasts would be divided into small items.

Next a tokenizer would be used to convert the smaller text data chunks into tokens. Once the tokens have been computed, then using an LLM embeddings API, it is possible to convert each chunk of text into a list of vectors.

Once we have our vector list they can be inserted into a vector database.

The upfront cost of using an embeddings API to get the vectors could be significant. In-fact per-token fees often apply to use such APIs. The advantage is that we are only doing this once for the entire dataset. Furthermore, once stored in the vector database, the vectors are persistent and can be used across different LLM sessions, prompts and even across different LLMs.

The challenge now would be to retrieve the context data from the vector database and pass it to an LLM prompt.

There are two approaches that we could try here:

Send prompt to Vector Database first and append results to LLM prompt.
The first approach would be to obtain a vector representation of the user / application query, and first submit it to the vector database index to perform a semantic search. The original string data from top N results would then be extracted and appended to the prompt. The prompt will then be sent to the LLM.
Use the LLM to generate queries to vector database when it ‘decides’ it needs to.
A more elegant approach could be to actually allow the LLM to generate queries for the vector database depending on the prompt received. This way the LLM could ‘choose’ what information it requires. For example the prompt could include “if you do not know the information being requested formulate a natural language query to look it up from a vector database.” Once we have prompted the LLM to generate the relevant queries when it needs to look up context data, frameworks can be used to chain the query to the vector database, and then for the results to be chained back into a prompt and fed back to the LLM.

The desired outcome of either approach would result in smaller amounts of data being inputted in to the LLM with each prompt. It also allows for models with smaller context lengths to be used. Furthermore by selecting only the most relevant context to provide with each prompt (instead of the entire context) we can ensure that the LLM is only given good quality prompts to reduce the risk of poor response quality.

Conclusion

This article does not attempt to provide tried and tested guides on how to perform in-context learning.

It simply attempts to explore the possibilities of improving LLM performance and adding time-sensitive and personalised capabilities by including relevant context in prompts instead of fine-tuning.

It also explores ideas on how to improve the performance and costs by ‘chaining’ a vector database as part of the process. Whether this approach is effective or not depends on the application and use cases.