Beyond simple RAG Architecture: Leverage AWS Cloud
A robust and scalable infrastructure is crucial for businesses to thrive in today’s competitive landscape. In this article, I will explore how to design a Retrieval Augmented Generation (RAG) architecture on Amazon Web Services (AWS). This discussion will focus on the three pivotal decisions I made in implementing a RAG architecture, leveraging the benefits of cloud computing.
Seamless User Onboarding
The first decision I made was to ensure seamless user onboarding. For managing user authentication within the RAG application, I chose to integrate Amazon Cognito. This flexible user identity and access management service simplifies sign-ups, logins, and password resets, making it easy for users to access the application and start using its features without any hassle.
The workflow begins with the user submitting their username and password to request a token from AWS Cognito [1]. Upon successful validation, AWS Cognito generates and sends the token to the client [2]. When the client makes an API request to the server, they include this token in the header of their HTTP request [3]. AWS Cognito verifies the token presented by the client during each API call [4]. If the token is found to be valid, the system proceeds with executing the requested action [5].
With user authentication taken care of, the next step is to ensure that the RAG architecture is scalable and cost-effective. This brings me to my second decision: the ability to scale down to zero.
Scale down to Zero
Next, I will describe a serverless solution specifically designed for the RAG architecture’s data ingestion process. Given the event-driven nature of data ingestion, it is essential to select a scalable and cost-effective approach that utilizes resources only when required while incurring minimal costs during idle periods.
To handle the data ingestion process for the RAG architecture, I have used a combination of Amazon Simple Storage Service (S3), Amazon EventBridge, AWS Lambda, and Amazon Aurora Serverless.
Amazon S3 is an object storage service that provides scalable and secure storage for incoming PDF documents. Amazon EventBridge is a serverless event bus service that makes it easy to connect applications together using data from various sources. AWS Lambda is a serverless computing service provided by AWS that allows me to run code without provisioning or managing servers. Amazon Aurora Serverless is a fully managed relational database service that automatically scales up and down based on the needs of the application.
For the RAG architecture, the data ingestion process begins with a new object being added to the Amazon S3 bucket [1]. To handle this event, I have configured Amazon EventBridge to initiate an AWS Lambda function [2]. The AWS Lambda function processes each PDF by splitting it into chunks and calling an inference API from Cohere to get the embeddings [3]. The derived chunks are then saved within an Amazon Aurora Serverless PostgreSQL database for secure and efficient storage, ensuring efficient retrieval and processing tasks [4].
Using a serverless solution for the RAG architecture’s data ingestion process offers several benefits. Firstly, it eliminates the need for infrastructure management, allowing me to focus on developing and improving the application’s core functionality. Secondly, serverless architecture automatically scales to handle varying workloads, ensuring high availability and responsiveness even during peak usage periods. Thirdly, the pay-per-use pricing model of serverless services like AWS Lambda and Amazon Aurora Serverless means that you only pay for the compute resources actually consumed, resulting in significant cost savings compared to traditional server-based architectures.
Deploy Mistral 7B on AWS EC2
Moving on to the deployment of a large language model (LLM), I have assessed multiple options and believe Mistral 7B is an excellent choice due to its optimal balance of quality and manageable size. Mistral 7B was created by Mistral AI, a company focused on developing open-source language models to advance the field of artificial intelligence. Its reasonable 14 GB VRAM requirements make it ideal for deployment on AWS g5.xlarge instances, which offer 24 GB of VRAM and utilize NVIDIA A10 GPUs.
For high-throughput and memory-efficient LLM inference and serving, consider implementing the Python package vLLM. For even higher-tier models that utilize more than one GPU, incorporating the Python package Ray may be necessary for efficient distributed inference.
While deploying Mistral 7B on AWS g5.xlarge instances offers greater control and customization, beginners might also consider using Amazon Bedrock. This fully managed service simplifies the deployment and integration of foundation models like Mistral 7B, allowing you to focus on building your application without the need to manage infrastructure.
That’s a wrap on my discussion of leveraging AWS for RAG architecture.