Cutting through the quagmire of using RAG with LLM’s — introducing BRAG

7 min readMar 29, 2024

If you’re reading this then I assume you’re like me in that you love AI, you love the opportunity it offers and are blown away with what we we can achieve with GenAI straight out of the box. Add into that one shot / few shot, prompt engineering and inclusion of Retrieval Augmented Generation (RAG) the capabilities and the power got greater, but lest we forget, the responsibility went through the roof too.

In a previous article, I talked about LLM / GPT Prompt Engineering and going beyond the basics, and how you can mitigate against confusion. In recent months though I find my head spinning with all of the new types of RAG that system to be coming out:

Using multiple sizes of chunking of embeddings, used in simple or more convoluted ways (using a small chunk size to search against and then expand to use that chunks parent)
Using Knowledge Graphs to hold the data
Asking the LLM to validate the context of the results, or summarize, or parse in some other ways
etc

Bottom line, these are all good ideas. They work. They are supported by some of the best libraries out there like LangChain and Llama Index. And that’s not to mention the newer models with larger context windows that basically make a lot of RAG irrelevant insofar as you can pass in 100’s of pages to look for your needle in a haystack. So what’s the problem?

Well, for me, I am a firm believer in the K.I.S.S. principal. Keep it Simple, Stupid!! So can we maybe take things back a step. Can we cut back on all the extended LLM calls with unnecessary tokens when we know the exact section of the document we’re interested in? Can we save the multiple LLM calls? Can we have a more Balanced RAG. Or as I call it — BRAG.

(OK I do appreciate the irony about commenting on the number of articles that come out constantly about approached to RAG, and then I’m guilty of doing the same, but hey ho :) ).

So what am I thinking, specifically?

Well what about this; I firmly believe that for the majority of Q&A activities (certainly that I have done) that I know that the question I am asking will largely be in specific sections of the document. More to the point, I will want to ignore things I find in irrelevant sections. So how about we:

Pre parse the documents to extract sections, but do not bother chunking and embedding them
Where we know we want information from a specific section(s) we can pass that into the prompt and not have to do any embeddings, cosine similarity etc

I have created a Jupyter Notebook if you want to follow along: https://gist.github.com/mkeywood1/063fcfa3aa996b6b6d8d1eaab1197df5. If you do follow along, you will need to set your system variables for your OpenAI Key (this is using Azure OpenAI but it’s easy enough for you to change to regular OpenAI).

In this we will use the wonderful Mixtral paper and test a couple of things with conventional RAG, and the BRAG approach.

OK so first up we set up Lllama Index, Open AI etc. All of this is in the notebook and just standard stuff so I won’t show it here. The document is ingested using the standard Llama Index PDF reader. So let’s get into the interesting functions:

def ask(question, context="", system=""):

    # Set default System Message
    if system == "":
        system = """You are an expert in Artificial Intelligence Research Papers. 
Use the following pieces of context to answer the users question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
"""

    # Prepend context if used
    if context != "":
        question = "Use the following context to answer the users question:\n```\n" + context + "\n```\n\n" + question

    response = openai.ChatCompletion.create(
        engine="gpt-35-turbo",
        messages = [{"role":"system","content":system},{"role":"user","content":question}],
        temperature=0.0,
        max_tokens=500,
        top_p=0.95,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None)
        
    return response['choices'][0]['message']['content']

This first one, ask is simply a wrapper to calling OpenAI GPT 3.5. Turbo, including a System Prompt about looking through research papers. It also accepts a contextvariable which is included in the prompt as necessary.

def extract_section(documents, section_name, debug=False):
    section_page = ""
    section_text = ""

    for idx, page in enumerate(documents):
        if section_text == "" and section_name in page.text.lower():
            if debug: print(idx)

            context = page.text
            if idx < len(documents)-2:
                context += "\n" + documents[idx+1].text
                context += "\n" + documents[idx+2].text

            answer = ask(f"Does the above have the section called '{section_name}' or similar, and does it, in detail, explain the {section_name}?", context)
            if answer.startswith("Yes"):
                answer = ask(f"\n-----\nWhat is the {section_name} in the document? Return everything in this section, up to the next heading. Do not interpret it, give me the verbatim text.", context)
                if debug: print(answer + "\n----------")
                section_page = idx + 1
                section_text = answer
                if debug: print(section_page, section_text, validate)

    return section_text, section_page

In the extract_sectionfunction, we do a couple of things:

We use the section_namewe pass in to do a really simple check. We iterate through all pages in the document and see if the text in section_nameexists in the lower case version of the page
If it does, it uses that page and the two subsequent pages and pass them into a couple of LLM prompts to see if it has a section named section_name and if so, it extracts the section verbatim
Returns a tuple of the the section text, and the page it which it was found

Of course, this is a one time activity. In reality this would be used and ran to extract the relevant sections and cache them for future use.

So let’s first start to build up a sectionsvariable. For the first section I am actually going to cheat a little and not use extract_sectionfunction because the section I want, authorsdoes not have a section heading, so we just use the askfunction and pass in the first page of the document.

sections = {}

sections["authors"] = (ask("Who are the authors mentioned before the abstract", documents[0].text), 1)
sections["authors"]

OK that looks good. now let’s use the extract_sectionfunction to extract the abstractsection.

sections["abstract"] = extract_section(documents, "abstract")
sections["abstract"]

OK so lets see if what we’ve done is of any use.

First let’s look at what license is applicable to this. We’ll start with the Llama Index search:

%%time

query = 'What licenses are mentioned?'
print(query)
answer = query_engine.query(query)
print(answer.response)

Oh that's a little disappointing. It couldn't find anything.

What about if we use just the abstract section.

%%time

ask(query, sections["abstract"][0])

That looks good. Not only did it get the right answer, it was also quicker because we only use the section of interest in the prompt, and not the k chunks that the semantic search thought would be relevant.

OK another quick check. Let’s ask a question about an author. This author was responsible for one of the papers in the References, but not actually an author of this paper. So asking if they are an author of this paper should say no, right?

%%time

query = 'Is Jacob Austin an author of this paper?'
print(query)
answer = query_engine.query(query)
print(answer.response)

OK well that’s a little odd. It thinks he was an author of this paper, And it thinks that because the semantic search found him as an author, but not distinguished it as being a paper in the reference and not the paper itself.

What about using the sections specifically?

%%time

ask(query, sections["authors"][0])

Well yes of course that it would work and recognise he is not an author of this paper. And of course it’s quicker because we only use the section of interest in the prompt, and not the k chunks that the semantic search thought would be relevant.

I personally use this approach a fair bit. I’m not saying it’s better. I’m saying it’s simpler. More Balanced. An alternate approach and another tool in your arsenal.

So, how about it? KISS and BRAG?

Thanks for reading.

Cutting through the quagmire of using RAG with LLM’s — introducing BRAG

Written by Martin Keywood