LLM Architectures, RAG — what it takes to scale

8 min readOct 21, 2023

AI is the topic of the moment, it is mentioned in all sectors and lines of business. From generating images for a designer, to the draft of a marketing campaign, to advanced chatbots that can support customer services.

These use cases are supported by AI, specifically Generative AI (GenAI) which is a subset of Artificial Intelligence. GenAI uses advanced Machine Learning algorithms to create generic models named Large Language Models (LLMs). These models were trained using vast amounts of unlabeled/unsupervised data allowing it to understand how humans communicate. In a nutshell, the model’s objective is to predict the next probable token, for simplicity let’s assume a token is a word. When we ask the model to complete a sentence it is using the question as a context and calculating the probability of the next words in that context.

Before jumping into the details of the headline on this article I want to set a baseline understanding on some topics.

Traditional Machine Learning Models vs Large Language Models (LLMs)

LLMs are created using Machine Learning Algorithms and these have been a topic of discussion since the 1950s when the first ML algorithms were mentioned and implemented.

The traditional machine learning models are mostly focused on trying to solve a particular use case, in contrast to a generic model for generic tasks. These are trained with either unlabeled (clustering, k-mean, etc.) or labeled data for a particular use case. The labeled data provides the model samples of valid occurrences, so it can “see” so many confirmed examples that when given a new example that it has never “seen”, the model can accurately solve the challenge.

An example would be to train a model on thousands of images of cars (labeled data that we knew had only cars) and then when provided a new image of a car outside of the trained data it would be able to classify it as a car and not as a bicycle. With the training data the model is able to extract features from the images that classify as a car, creating a mathematical reference of a car.

On the other hand, LLMs are still machine learning algorithms but instead of being trained on pre-classified labeled data these were trained on unlabeled data. This means text documents without no categorizations or features, just billions of lines of text like the one I’m writing now in this article. These Machine Learning algorithms use a technique called the Transformes ML which was the discovery that allowed to train these models efficiently.

As an example, the LLM GPT (from ChatGPT) was trained on documentation from Wikipedia, Reddit, Quora, BookCorpus, WebText, CommonCrawl, and potentially everything you can remember from the internet as a good source of data.

A Traditional ML model is trained with a specific purpose or task in mind, while a Large Language Model is trained to understand how humans communicate in natural language, hence it is capable of generically addressing tasks.

Large Language Models are not databases of knowledge, it is text generated based on probabilities, hence sometimes not reliable. If the topic is widely talked about in the training data there is a high probability the model will write correctly about it, if not, the model will provide answers that may be completely wrong even though the answer may seem plausible.

LLMs Concepts to know:

Prompt — the input (question) to the LLM, the better the prompt the better the likelihood of a good response
Temperature — the parameter that helps control the model creativity. A higher temperature means higher creativity and a lower temperature means more factual responses.
Tokens — tokens are the units that the model uses to process the prompts/responses. These can be seen as pieces of words and each LLM is capable of handing more or less tokens as context.
Hallucinations — this is the term used when a model creates a response that is not real or completely against the expected.
Embeddings — An embedding is a vector, a mathematical representation of a word in numbers. These are used to identify similarities between words or phrases. The distance between two vectors identifies their relatedness.

Limitations on the off-the-shelf LLMs

Moving from a PoC implementation to a production-ready solution requires a lot more than simply invoking LLMs APIs and a few prompts.

Deploying use cases that leverage GenerativeAI LLMs brings up some challenges, such as:

input context is limited (example GPT4 has 8k and 32k tokens limits)
large input context does not mean better results — scaling is a problem with very large input contexts which can create more hallucinations
large input context increases the model costs — more context means more tokens, hence higher cost
not able to know the exact data the model used for training
not knowing which documents/information was used to produce a response

These challenges can be addressed with a proper LLM architecture, that requires more than LLM API calls or prompt engineering.

90% of the work to build a production-ready use case is outside the core LLM but rather on data ingestion; data storage; data categorizations; validation guardrails; caching and a pipeline to orchestrate all these.

RAG — Retrieval Augmented Generation Architecture

One of the most promising development patterns that address the limitation points above is called RAG — Retrieval Augmented Generation, especially for use cases that require:

Complex and Large Summarizations
Knowledge Base Q&A
Particular in-house and recent documentation that the model does not know

In a nutshell, this pattern takes as inputs a query (the ask) and fetches additional context information that the LLMs are not aware of, for example about the recent war in Israel, and provides that as part of the Prompt to send to the LLM. With this approach when the question is made to the LLM it already includes the relevant context data for the model to accurately provide an answer.

Even in scenarios where the LLM may have the answer, sending the right context as inputs on the prompt reduces the risk of hallucinations.

These contexts are typically obtained from a Vector Database — the trick is how to properly load this Database using the Data Ingestion Pipeline and how to semantically search for the relevant data.

The Vector Databases hold the context of all relevant support documentation split into chunks with their respective vector embeddings for search. If we have a document with 100 pages to use as input, we break it into chunks, by paragraph or by chapter, and then calculate an embedding for each chunk. Then the similarity search using the question will fetch only the relevant chunks of data from the vector database and pass it dynamically as part of the Prompt. Chunks from the document that are not related to the questions will not be considered.

A well-designed prompt is typically composed of different sections to interact with the LLM:

Prompt structure with an example for a summarization task

The challenge becomes on how to fetch data efficiently to dynamically pass it as the context in the Prompt!

RAG Architecture’s main challenges:

An efficient fetch of data relies on the similarity search between the question and the data available in the vector database. Some of the main challenges are:

Identifying the best similarity algorithm for the vector databases
Humans making well-composed questions
Loading and tagging of the right data into the vector databases

Why making well-composed questions is important?

The success of a great context search starts with the user query, if it is not great then the result may be mediocre even if you have an amazing technical implementation. Humans tend to create mediocre queries, here is why:

make typos
short on vocabulary
ambiguous questions
very short and simplistic questions
lack of context in the questions

Whenever this happens it reduces the chances of a great similarity search for context!

To address this, instead of using a single query to search against the vector Database, we can leverage the LLM to generate more variations of the user query to expand it. This means we can search for context not only for one user query but for as many as the ones the LLM created and as well remove any typos. Then as a next step, we can run these intermediate responses through a ranking system to select the top chunks of data for our final answer.

Multiple Query Generation for Context Search

Fetching of data efficiently

The data ingestion pipeline for a RAG architecture should be no different than the traditional ML and Data Engineering pipelines. Solutions such as Airflow for orchestrations, Apache Hadoop, or Kafka can be used for data ingestion and pre-processing.

Whenever RAG architecture is mentioned, a vector database always comes along to store the data (vectors) and support efficient search for context. What must be understood is that the secret sauce is not the vector database itself but rather how and what to load into the vector datastore regardless of what it is. Some of the most mentioned vector stores are:

Pinecone
Chroma
Postgres pgvector
Elasticsearch
Cloud providers vector engines (Azure Vector Search; AWS OpenSearch; Google Vertex AI Vector Search)
Others

The most important thing is to build a robust data ingestion pipeline and not what type of vector store to use. That could be anything as long as data is properly chunked and categorized.

In the RAG architecture, some of the data ingestion components that must be in place are:

Data Loaders (to process PDFs, Word documents, Images with OCRs, audio files, etc.)
Documents pre-processing (cleaning up garbage from files, for example, non-relevant footnotes)
Document tagging (time, context keywords, etc.)
Document duplication detection
Document chunking (split large documents into smaller files in a logical way)
Document storage (store the chunks and their embeddings)
Document indexing
Document caching
Data retrieval algorithms

These must be considered as part of the architecture to achieve better results.

Not everything on a RAG architecture is solved by simply loading data into a vector database! Focus on the data ingestion pipelines, efficient categorization, and split of chunks along with the right approach for data segmentation.

In part 2 of this post, I will share some code on how to create a RAG architecture! Hope you wait for it :)

Thank you,

Daniel