Jingle Bell RAG

Published in

Mad Chatter Tea Party

8 min readDec 11, 2023

“A large language model (LLM) is a type of language model notable for its ability to achieve general-purpose language understanding and generation.”

TL;DR

Large Language Models (LLMs) are driving change in the new A.I. industrial revolution by creating, erasing and automating work and becoming the core of so many systems and applications we use every day.
If end-user use through APIs, libraries, and services has been made super-easy, it is less straightforward to drop these giants into domain-specific applications.

LLM pros

Ready to use models were trained on a gigantic set of (publicly available) data;
Pretrained models save costs;
Simple integration using API;
Pretrained models has a lower carbon footprint.

LLM Cons

They haven’t internal knowledge base:

No access to internal documents/resources in train (As we mentioned among the pros, LLMs are trained on publicly available datasets, there are vertical markets with very few resources on which an LLM can learn, think for example of medicine, although scientific papers are a huge amount a great many are copyrighted so not available, some niche branches may have little information, …)
No access to domain — specific vocabulary

Fine-Tuning

In machine learning fine — tuning means the “retraining” of a pre — trained model to fit it into a particular context.

“Fine-tuning a large language model involves adjusting and adapting a pre trained model to perform specific tasks or to cater to a particular domain more effectively”

Imagine a model that is used to recognize animals, it will have been trained on billions of photographs of animals, we only need it to recognize dog breeds but we do not have billions of photos of dogs.
We can fine — tune it, i.e., keep the part of the neural network that recognizes the “elementary parts” of the image such as triangles, outlines, shading, … and train only the final part, i.e., recognizing the type of dog.
Something very similar happens with the language, you take a pre-trained LLM, a set of documents with private or industry information, and hone in on that.

Supervised Fine-Tuning on LLMs. Source: Neo4j

Described in this way it all sounds perfect but in the community of researchers, developers and LLM users it’s a very open discussion, I personally do not lean toward this solution because:

Unclear if it will learn your internal data
Looses instruction capabilities (i.e. «Give me a bullet points list of…»)
Host an LLM is costly (although lately with GPTs it’s a lot cheaper but then you would have to talk about privacy and that’s something I don’t want to get into)
Needs to be done for every new data (every market, every customer, every dataset version)

Retrieval — Augmented Generation

“Retrieval Augmented Generation (RAG) means fetching up-to-date or context specific data from an external database and making it available to an LLM when asking it to generate a response”

The internal knowledge of the LLM is not relied upon to produce responses. Instead, the LLM is used only to extract relevant information from the submitted documents and summarize it.
To schematize this process, one can use the example of the library:

Imagine we are a pre-trained LLM, we have a lot of knowledge but we are not omniscient, however, we can extract new knowledge from a context;
We go to the library and ask the librarian what we are looking for;
He will return the books that might deal with the topic we are interested in;
After reading them we can formulate our response.

In a RAG pipeline, the librarian is a Semantic Search system. In the Cheshire Cat this system is based on Qdrant vector database and Dense Retrievers.
I wrote about it on the Cheshire Cat Blog here and here.

The RAG pipeline is a conceptually very simple but very powerful system with many pros:

application retrieves the most relevant data for the LLM directly at the time of response generation. So no fine-tuning, no need for us “by hand” pass the context into the LLM prompt;
reduces the likelihood of hallucinations. Giving it a focused context makes it more difficult for the LLM to “brag knowledge” as is often the case when using APIs directly for questions about particular contexts;
the LLM can provide its own sources. A fine — tuned LLM (or even the basic one) loses this ability;
possibility to use lightweight, open source retrievers;
the system is easily scalable. By not having to do fine-tuning but only changing the domain knowledge base the system is replicable N times with very little effort.

RAG also has cons but let’s see how to overcome them easily.

Better context for better answers

The retriever plays a key role in RAG because this is where the documents in which our LLM will have to look for information to respond start from, so providing a good context is the basis for having a good RAG system and application.

Chunking

When sentences are uploaded to the Vector DB they are first chunked. Responses will be processed from these sections.
There is no one chunking strategy that is better than the others, for example for Q&A applications small chunks with overlap are better, for summaries longer chunks have better performance.
One strategy I have tested successfully is multiple chunks, for the same sentence you have chunks of different size and overlap. This exponentially increases the size of the Vector DB so be careful!

Metadata

When we create a vector to store into the Vector DB, we can associate some metadata with it.
In the search phase these are exploited to filter the possible points between which to apply similarity.

Multi Indexing

Retrieve documents from different collections.
In Cheshire Cat we use three collection in Long Term Memory.

Episodic Memory, contains an extract of things the user said in the past;
Declarative Memory, contains an extract of documents uploaded to the Cat;
Procedural Memory, contains the set of Python functions that defines what the Cat is able to do.

Hybrid Retrieval

Like the LLM, the retriever in the train phase also stores knowledge from public datasets so it may not recognize domain terms, always using the example of medicine we can think of the various acronyms, molecules, … so how to solve knowing that even for these models fine — tuning does not guarantee that the model will be able to store knowledge in the new documents?

Hybrid Retrieval is a technique that merges embeddings-based Dense Retrievers with classic keyword-based Sparse Retrievers like BM25.

Why does this system bring improvements? Because alongside the Dense Retriever that finds results based on meaning we have the Sparse Retriever that works directly on the word, so in case the embedder does not know specific terms they will still be retrieved from the keyword search.
In this case, the most complicated part lies in choosing the join function because the two lists of documents are not comparable by score. The easiest method to apply is to do a concatenation of the results, eliminating duplicates, and have the score recalculated which will then give the final sorting to another type of models, Rankers.

Retrieve and Re-Rank

The retrieve and rerank (R&R) approach is a search technique that combines two steps: retrieving a large set of relevant documents and then reranking those documents to identify the most relevant ones.
The first step in the R&R approach is to use a retrieval system to find a large set of documents that are likely to contain the information the user is looking for.
The second step in the R&R approach is to use a reranker to reorder the retrieved documents based on their relevance to the user’s query. This can be done using a variety of techniques, such as machine learning, natural language processing, knowledge graphs or simple reordering.

Hypothetical Document Embeddings (HyDE)

The HyDE method is based on the assumption that in order to find relevant documents, it is better to feed the semantic search system a similar document (albeit a false one with incorrect information) rather than a query.
To do this at the beginning of the pipeline an LLM is inserted that taken the question generates this hypothetical document that is then used to search for relevant documents.
You can use HyDE on Cheshire Cat installing the plugin.

RAG-fusion

Very similar to HyDE but in this case the LLM at the beginning of the pipeline does not create a hypothetical document but new queries from the initial one.
All queries are passed to the semantic search system, the result lists are eventually merged using a function, usually Reciprocal Rank Fusion is recommended.

Lost In The Middle

In the summer of 2023 an article is published and had a lot of hype, not least because of the really catchy name, “Lost in the Middle”.
In summary in the paper the authors show that LLMs perform best if the relevant information is at the beginning or the end of the context getting lost in the middle part, e.g. if given a query we get 40 documents returned the LLM will perform very well on the first and last 10 and less in the 20 in between.
If you want to learn more check our article about LITM.

Prompt Enineering

Another hot topic is prompt writing, here you need not only technical expertise but also domain expertise because you need to know what to ask as well as how to ask for it.
The prompt stops not only at the request for the information but also how that information should be provided (one thing is a RAG system for a chat another for extracting information to write to a relational DB or produce structured documents).
A good prompt can save a lot of time and resources; it can also correct model behavior by mitigating bias.