RAG-ing Success: Guide to choose the right components for your RAG solution on AWS

10 min readJul 5, 2023

With the rise of Generative AI, Retrieval Augmented Generation(RAG) has become a very popular approach for using the power of Large Language Models (LLMs). It simplifies the whole Generative AI approach while reducing the need to fine-tune or eventually train an LLM from scratch. Some of the reasons why RAG has become so popular are:

You can avoid hallucinations where the model tries to be “creative” and provides false information by making things up.
You can always get the latest information/answer around a topic or question without worrying about when was the training cut off for the foundation model.
You can avoid spending time, effort and money on complex process of fine tuning or eventually training on your data.
Your architecture becomes loosely coupled.

Below diagram depicts a simplified component architecture diagram of RAG:

Looking at the diagram above, it has the following components:

An Embedding component converts all the raw text into Embedding (vector representation of text).
A Vector Storage and Retrieval engine which stores all the vector information and provides quick retrieval.
User submits a query via a chat interface. The query gets converted into embeddings by the same embedding component.
The embeddings are searched through the vector engine. The vector engine retrieves relevant context (chunks which potentially have the answer) and sends the query and context to LLM.
LLM reads the context with query and provides a precise answer.

There are many ways in which each of embedding, vector and LLM component can be designed. Currently, there is a lack of clear guidance on which tool/model/library to use for each of the components.

This blog is an attempt to provide guidance on how to select the right tool for each of the components of RAG while designing the solution on AWS.

NOTE: Since it’s hard to cover every tool out there for each of the components, the tools are chosen based on popularity while implementing RAG on AWS.
Another Note on cost estimates:
The cost mentioned here are a reference value based on default reference instance recommendations. It would change if a different instance is chosen. Also, single-AZ deployments have been chosen here. For production workloads, the recommendation would be to have multi-AZ deployment. The calculations are based on standard instance pricing. The cost comes down if using reserved instances or using any other discount plan.

With that said, lets begin the comparative analysis starting with the embedding component.

Embedding component

This component is responsible for converting the raw language text into vectors also known as embedding. So lets jump into options.

BERT(base uncased): The first embedding model chosen for comparison is BERT(base uncased). This model can be deployed directly from SageMaker Jumpstart. Its known to perform well at generating contextualized embedding. The size of these models are around ~500 MB. Hence, to run such model on Amazon SageMaker, It can easily be done on a ml.p2.2xlarge. There are many fine-tuned variants also available for multi-lingual use-cases and also in terms of parameter size. It can be used well in cases where multi-lingual support is needed and where teams are looking for more flexibility in choice of models.

GPT-J: Most popular option referred in many RAG based blogs on AWS. Again, it can also be hosted on SageMaker with a single click from SageMaker JumpStart. It’s a fairly large model, hence requires more powerful compute, but provides really high quality embedding. In fact, it’s one of the top performing embedding model as per their evaluation results.

Hugging Face sentence-transformers: Another very popular sentence-embedder provided by Hugging Face library, deployable via SageMaker JumpStart. One of the the smallest, yet very powerful option on the offering. Due to its small size, it can be quite cheap to run this on SageMaker Endpoint and produces good quality embedding. Apart from this, there are variety of fine-tuned versions available even on domain specific data.

NOTE: This model can very well run on AWS Lambda as well, which can reduce the cost significantly. The reason for showing this as endpoint on SageMaker is to ensure parity of architecture where any of the components can be replaced without any change in the design.

Amazon Bedrock : Next option is to use Amazon Titan Text Embedding model or cohere Embed model via Amazon Bedrock API which is still in preview. But since Amazon Bedrock is serverless, it would be quite easy to use this model in a RAG solution with no operational overhead. The pricing information is not yet available but will be made available as the service becomes Generally Available.

To summarize, this is how all the options stack up against each other.

The next component in RAG architecture is a vector store, lets explore what options we have there.

Vector Store

This component stores all the embedding in a way that makes it easy to retrieve resultant embedding for a query. Lets dive deep in the options.

Note on cost: Since some of the options discussed here are not AWS native offerings. Hence we discuss Total Cost of Ownership (TCO) which doesn’t only include usual compute, storage cost but the cost of maintaining the solution in terms of skilled hours and other operational overheads around testing, environment setup etc.

Amazon OpenSearch (Managed hosting): Amazon OpenSearch is a distributed, community-driven, Apache 2.0-licensed, 100% open-source search and analytics suite used for storing and retrieval of embedding. Powered by the Apache Lucene search library, it supports a number of search and analytics capabilities such as k-nearest neighbors (KNN) search which is ideal for vector retrieval. You can quickly setup it up via boto3 APIs or AWS console. It can scale really well with storage and compute via sharding and dedicated master nodes. General python knowledge is needed to set it up.

Amazon OpenSearch (Serverless): AWS recently announced vector search support for OpenSearch Serverless with k-NN OpenSearch APIs as well. With this announcement, OpenSearch becomes a really strong contender to be chosen among the other vector storage options discussed here. It provides the same quality outputs rendered by OpenSearch APIs, with added benefit of being serverless.

Amazon RDS/Aurora(PostgreSQL) using pgvector: Another promising option is using Amazon RDS(PostgreSQL) or Amazon Aurora(PostgreSQL) with open source pgvector extension. Just setup RDS cluster and install pgvector. As its a manual effort to install pgvector, there is some operational overhead involved in terms of keeping the extension up-to date and in some cases also tune it. Also, understanding of SQL along with python will be needed to run this solution. So, TCO will be bit higher than the previous option. It uses L2, cosine and inner product as some of the techniques to find relevant embedding.

FAISS: The most popular non-AWS alternative for storing vectors. Its open source and provides lightning fast retrievals. It uses a myriad of approaches with Product quantization, HNSW and IF. It can scale very well with even GPU support available for super fast performance. But since its a manual installation, the total cost of ownership is quite high. As a developer, you need to setup the whole cluster for FAISS, update it, patch, secure it and tune it as per your requirements. You can run it on EC2, Amazon ECS, Amazon EKS or any other persistent compute offering on AWS.

Special Mention: Amazon Kendra — replacing embedder and vector store with a single component

There is one more option which deserves its mention due to its ability to replace both embedding and vector db component.

Amazon Kendra: A fully managed service that provides out-of-the-box semantic search capabilities of documents and passages. No need to deal with word embeddings, document chunking and vector store etc. Amazon Kendra provides the Retrieve API, designed for the RAG use case. It also comes with pre-built connectors to popular data sources such as Amazon Simple Storage Service (Amazon S3), SharePoint, Confluence, and websites, and supports common document formats such as HTML, Word, PowerPoint, PDF, Excel, and pure text files. Since it a serverless experience replacing two components at once, the TCO can be quite low.

NOTE: TCO for Kendra is mentioned as low because of primarily two reasons:

1. It is replacing two components of RAG architecture(Embedding and Vector store). So, considering the combined TCO of managing the Embedding model and Vector store, Kendra’s TCO would come out to be cheaper. Plus Kendra also offers volume discounts. The more the consumption, the lesser the costs.

2. Its serverless.

And to end this section, here is the summary of all the options we discussed:

Apart from the above, some honorable mentions which were not covered as part of Vector database comparison are Weaviate, Pinecone and chroma which also have good adoption in the developer community.

Next up is the last component of RAG, which is the LLM. Lets explore the choice of LLMs on AWS

Large Language model

This space is moving extremely fast so I would cherry-pick the ones which are quite popular in RAG based reference implementations on AWS.

NOTE: Some of the third party models mentioned below can very well be accessed via their own APIs but using them via AWS has advantage of these models being hosted inside AWS itself which improves security and networking posture.

Jurassic-2 via SageMaker Jumpstart: The first option here is the Jurassic-2 from AI21Labs. It is also available via SageMaker JumpStart. You can deploy the model on SageMaker Endpoint with in minutes from SageMaker Studio or via SageMaker APIs. Remember here, the developer experience is not serverless.. The cost is made of vendor model price and SageMaker instance pricing. Its biggest advantage is that its multi-lingual and have been rated as one of top performing LLM on Standford’ HELM benchmark so you can expect high quality output.

Falcon via SageMaker JumpStart: Another very popular alternative is using Falcon model. Due to being open source, the software pricing become $0 which makes it a really compelling choice on AWS. You only have to pay for instance pricing. Its ideal for Q&A and advanced information extraction. Keep in mind that context length is around 2k which might be small for some use-cases.

Llama2 via SageMaker JumpStart: The the top performing LLM on Hugging Face leaderboard as of writing this blog, Llama2 is also available as part of Amazon SageMaker JumpStart. Being published as open source*, the software pricing become $0 which makes it a really compelling choice on AWS. You only have to pay for instance pricing. There are multiple variants of it, in the form of 7B, 13B and 70B.

*Read the EULA carefully, there are certain restrictions to commercial use.

Special mention: Amazon Bedrock —pick the LLM of your choice

Amazon Bedrock: Amazon Bedrock is the most unique offering in this list. It is a fully serverless experience where a single Bedrock API can be used to invoke models from Amazon Titan, Jurassic2 from AI21Labs, Claude2 from Anthropic, Command and Embed from cohere and Stable Difusion models from Stability.ai. Bedrock lets you choose the model that’s best suited for your use case. It handles all the infrastructure and scaling behind the scenes. Due to the simple fact that its serverless, makes a very strong case for the TCO to be quite low.

And to end this section, here is the summary of all the options discussed in this section:

Conclusion

To conclude, there is no universally correct answer to finding the ideal components for RAG based solution. And from a technical standpoint, there should not be a universal answer as every customer problem, environment and domain data is unique in itself, so it would be really sub-optimal to apply the same solution everywhere. This article attempts to compare different options but cannot (and does not) claim to be an exhaustive study of all the facets of all the options mentioned here. As a Reader, you are are advised to use your discretion and lots of data while making architectural choices for RAG.

If you liked what you read, please give a clap and share it with in your network. As a technologist who loves to write, I will be sharing lot more interesting articles here so feel free to follow me here and on linkedin.