How to Compress LLM Contexts with LangChain

4 min readFeb 1, 2024

Retrieval Augmented Generation (RAG) is a powerful framework for providing LLMs the necessary context for a specific use case. Unfortunately, token usage can get quite expensive, especially if your RAG setup is returning dozens of larger files. Recent research has indicated that LLMs may struggle to find the correct result if the result is buried in the middle of your RAG results.

The solution to this: Contextual Compression

Context Compression in LangChain

In this tutorial, we will set up contextual compression in LangChain using ContextCrunch for efficient compression.

Aside: LangChain

LangChain is a powerful tool to build LLM-centric data pipelines in an intuitive manner. This tutorial will assume you already are somewhat familiar with RAG and LangChain. If you would like to learn how to build a simple LangChain RAG pipeline, check out their documentation

Prerequisites

You need to install the ContextCrunch-LangChain integration in addition to LangChain, as well as the OpenAI python SDK. You can install everything with pip install contextcrunch-langchain openai

Get Started

We’ll first initialize the OpenAI client, as well as create a mock retriever to act in place of whatever document retriever you are using. This way, we can obtain a consistent output.

from typing import List
from langchain_core.retrievers import BaseRetriever
from langchain_core.documents import Document
from langchain.chat_models import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.callbacks import get_openai_callback
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
import requests

gpt4_chat = ChatOpenAI(model_name="gpt-4-1106-preview", max_tokens=50, temperature=0, api_key="YOUR_OPENAI_API_KEY_HERE")

class MockBaseRetriever(BaseRetriever):
    documents: List[Document] = []
    def __init__(self, documents):
        super().__init__()
        self.documents = documents
    def get_relevant_documents(self, *args, **kwargs):
        return self.documents

Next, we’ll retrieve a text sample emulating a large RAG QA result, from the lost-in-the-middle dataset, and convert it into separate LangChain Documents. Last, we’ll wrap the documents in our MockBaseRetriever from before. I’ve extracted a single example text from the dataset for the sake of this tutorial at https://raw.githubusercontent.com/Speuce/Blogs/master/2024/rag_text.txt

Aside: Documents are a LangChain wrapper object that stores the core text of a RAG document along with associated metadata

url = 'https://raw.githubusercontent.com/Speuce/Blogs/master/2024/rag_text.txt'

response = requests.get(url)
text = response.text
documents = [Document(page_content=content) for content in text.split('\n\n')]
retriever = MockBaseRetriever(documents=documents)

Prompt Template, and Question

Let’s define a prompt template to use, as well as the corresponding question

prompt_template = ChatPromptTemplate.from_template(
"""
Write a high-quality answer for the given question using only the provided search results.{context}
Question: {question}
Answer:
"""
)
question = "in the dynastic cycle what is the right to rule called"

Baseline Usage & Performance without Compression

To get a baseline, we’ll first plug this into GPT-4, and measure the corresponding token usage.

We’ll first define a pipeline using LangChain Expression Language (LCEL):

rag_chain = (
        prompt_template
        | gpt4_chat 
        | StrOutputParser()
    )

And then we’ll wrap the invocation in get_openai_callback, which is langchain’s way of tracking token usage:

with get_openai_callback() as cb:
    result = rag_chain.invoke({"question": question, "context": text})
    print(f'Result: {result},\\n callback: {cb}')
    original_prompt_cost = cb.total_cost

Result:

Result: In the context of the dynastic cycle, particularly as it pertains to Chinese history and philosophy, the right to rule is called the "Mandate of Heaven" (Tianming). This concept held that the Emperor was chosen by Heaven—the,
 callback: Tokens Used: 3912
	Prompt Tokens: 3862
	Completion Tokens: 50
Successful Requests: 1
Total Cost (USD): $0.04012

The result is correct! “Mandate of heaven” is what we’re looking for here. Unfortunately, nearly 4k tokens were used in this context. Let’s look at reducing it.

Context Compression

Now that we have a set of documents retrieved, we can get to the meat of prompt compression.

First, we import and instantiate an instance of ContextCrunchDocumentCompressor. Ensure that you have your ContextCrunch api key, which you can get at https://contextcrunch.com/console/keys

from contextcrunch_langchain import ContextCrunchDocumentCompressor
cc_compressor = ContextCrunchDocumentCompressor(compression_ratio=0.9, api_key="YOUR_CONTEXT_CRUNCH_API_KEY_HERE")

Next, we will wrap the document compressor around the retriever from earlier using LangChain’s ContextualCompressionRetriever. If you're using any other retriever (such as that from a vector DB), you would similarly wrap that retriever in the ContextualCompressionRetriever as well.

from langchain.retrievers import ContextualCompressionRetriever
contextcrunch_compression_retriever = ContextualCompressionRetriever(base_compressor=cc_compressor, base_retriever=retriever)

Next, we create a modified pipeline that uses the compression retriever we created. In order to format the output documents as a single text block to feed into the prompt, we also declare a format_docs function to join the documents with newlines.

def format_docs(docs):
    return "\\n\\n".join(doc.page_content for doc in docs)

rag_chain_2 = (
    {"context": contextcrunch_compression_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt_template
    | gpt4_chat 
    | StrOutputParser()
)

Finally, we use the chain (this time only invoking it with the question, since the context comes from the retriever), and compare token usage.

with get_openai_callback() as cb:
    result = rag_chain_2.invoke(question)
    print(f'Result: {result},\\n callback: {cb}')
    new_tokens = cb.total_cost

Result:

Result: In the dynastic cycle, the right to rule is called the "Mandate of Heaven." This concept originated in ancient China and was used to justify the rule of the Emperor. According to this belief, Heaven, which was a supreme force of,
 callback: Tokens Used: 438
	Prompt Tokens: 388
	Completion Tokens: 50
Successful Requests: 1
Total Cost (USD): $0.00538

As you can see, the correct anwser of “Mandate of Heaven” is still present. The prompt compression worked!

Cost Savings

Now we can calculate how much we saved as a percentage of the original cost.

cost_savings = (original_prompt_cost - new_tokens) / original_prompt_cost
print(f"Cost savings: {cost_savings*100}%")

Result:

Cost savings: 86.5902293120638%

Conclusion

Hopefully you know the power of contextual compression and how to integrate ContextCrunch into a LangChain RAG pipeline to save significantly save on GPT-4 usage!

The full Python Notebook for this tutorial is available at: https://github.com/Speuce/Blogs/blob/master/2024/compress_rag_langchain.ipynb

How to Compress LLM Contexts with LangChain

Context Compression in LangChain

Prerequisites

Get Started

Prompt Template, and Question

Baseline Usage & Performance without Compression

Context Compression

Cost Savings

Conclusion

Written by Matt Kwiatkowski