Retrieval Augmented Generation (RAG) on AWS

Published in

Storm Reply

7 min readMar 1, 2024

🤔 Imagine empowering a language model with the expertise of a seasoned domain expert, without the need for extensive training and costly development efforts.

❗️ The RAG framework revolutionizes Generative AI by seamlessly integrating your domain knowledge into an existing language model, eliminating the time-consuming and costly process of building a model from the ground up.

RAG: Retrieval Augmented Generation framework

Generative AI models, while impressive, lack the contextual understanding and knowledge base needed to truly excel in specific domains.

The Retrieval Augmented Generation framework, as its name suggests, combines retrieval and generation capabilities to craft insightful and comprehensive responses, even when utilizing Generative AI models lacking domain knowledge.

But how does it work?

Retrieval: due to the Generative AI model’s limited understanding of the domain, it may struggle to provide context-aware answers; indeed, the RAG framework employs a search engine to locate the information you seek within your documents.

Augmented Generation: the search engine identifies documents pertinent to your query. The LLM takes (1.) the query and (2.) the retrieved information within a single, comprehensive prompt.
Now fortified with contextual knowledge from the retrieved documents, the Generative AI model embarks on crafting a comprehensive and informative response that’s not only factually accurate but also deeply contextual and insightful.

Wrapping up, RAG systems are based on two main components:

a Search Engine, for retrieving contextual information within the Data Lake
a Large Language Model, for generating comprehensive responses given the contextual information and input query

Search Engines

Nowadays, textual search engines rely on vector databases.

In this context, vector databases are meant to store embeddings — i.e., the vector representation of textual documents.

Searches are performed by measuring the similarity between the vector representation of the input query and the document embeddings stored in the vector database (see Appendix).

Similarity measures evaluate how relevant a document is to a query.

Large Language Models implications in Search Engines

Large Language Models (LLMs) have revolutionized search engine development, making it far more accessible than ever before. LLMs are pre-trained to create their own word representations, essentially mapping words into a high-dimensional space.

Indeed, simply feed an LLM a text document, and it will output a vector embedding that captures the document’s core meaning within this learned space.

These embeddings can then be seamlessly stored in a vector database for future retrieval operations.

Summing up, the two main implications of LLMs in RAG systems are:

LLMs employed as text embedding models, i.e. as models that take a document as input and return the embeddings (i.e. the vector representation) of that document
LLMs employed as text generation models, i.e. as models that take as input a prompt (in the case of RAG, the combination of the input query and the information context) and output a comprehensive and inclusive response

Retrieval Augmented Generation on AWS

Amazon Web Services (AWS) provides a wide variety of services that can meet your performance and cost-effectiveness requirements.

In the following sections, we will present the most popular vector databases available on AWS along with Amazon Bedrock, the AWS service that allows you to effortlessly integrate state-of-the-art Large Language Models within your applications.

Vector Databases on AWS

Amazon Neptune Analytics

Amazon Neptune Analytics is a graph database that excels at handling highly interconnected data.

Beyond traditional graph traversals, Amazon Neptune supports vector search using graph embeddings: indeed, it can efficiently find documents similar to your query based on their semantic meaning.

This capability aligns perfectly with RAG’s reliance on vector representations for information retrieval on complex knowledge graphs.

Furthermore, Amazon Neptune Analytics provides seamless integration with S3 and Neptune Database endpoint, making it easy to load graph datasets.

Amazon OpenSearch

Amazon OpenSearch Serverless is a managed service that allows seamless provisioning, configuration, and auto-scaling of an OpenSearch cluster.

Along with OpenSearch’s full-text search and log analytics capabilities, Amazon OpenSearch Serverless provides similarity search capabilities exploiting OpenSearch’s k-NN search plugin.

Amazon RDS (with PostgreSQL extensions)

Amazon RDS is a managed database service that simplifies setting up, operating, and scaling relational databases in the cloud. It automatically handles administrative tasks, allowing you to focus on data and applications.

In particular, Amazon RDS for PostgreSQL supports the pgvector extension, allowing to seamlessly store and query embeddings directly within PostgreSQL databases — without losing ACID compliance and the other features of Postgres.

Amazon DocumentDB

Amazon DocumentDB is a fully managed document database service that excels at storing and retrieving semi-structured data.

Natively supporting JSON documents, it aligns naturally with knowledge bases — such as heterogeneous amounts of textual documents.

Amazon DocumentDB combines vector search functionalities with the flexibility of JSON-based document database features, enabling the development of Generative AI use-cases based on semi-structured data.

Amazon Kendra

Amazon Kendra is a fully managed intelligent search engine that makes it incredibly easy to search information within documents stored in your data lake.

Differently from the aforementioned AWS services, Amazon Kendra is a search engine that provides built-in connections to third-party data sources, allowing you to effortlessly retrieve information from your documents with no need for complex vector similarity queries.

Vector Databases on AWS: Features

Choosing the right vector database hinges on your specific requirements. Consider factors like:

Data volume and complexity: How much data do you have, and how intricate are the relationships between documents?
Query performance: How quickly do you need results returned?
Scalability: Can the database handle growing data volumes and query loads?
Cost: Evaluate the pricing models of each service to find the most cost-effective solution. AWS provides detailed documentation and pricing information for each service, empowering you to make an informed decision.

As described in the previous section, each AWS service presented has its own characteristics.

When choosing for a specific vector database / search engine, other factors that must be taken into account are:

Seamless integration with third-party data repositories:
Amazon Kendra leverages built-in connectors to a wide variety of data repositories, seamlessly integrating data within its searchable indexes.
Amazon Neptune Analytics offers direct connections to S3 Buckets and Amazon Neptune endpoints.
Integrated connectors streamline the ingestion pipeline (i.e., loading documents into the database), minimizing development effort and ensuring ongoing synchronization of fresh content within the vector database.
Effortless embedding management:
Amazon Kendra effortlessly manages the vector embedding creation and management, so you don’t have to deal with low-level complexities.
On the counterpart, while Amazon Kendra handles embedding creation internally, the other services empower you to (1.) tailor the embedding process to your specific needs and (2.) evaluate different LLMs to identify the optimal fit for your application.

Large Language Models on AWS: Amazon Bedrock

Harness the power of large language models (LLMs) seamlessly with Amazon Bedrock. This service simplifies LLMs integration, enabling you to leverage pre-trained models like Jurassic-1 Jumbo or Megatron-Turing NLG for various tasks, including:

Text generation: Craft compelling product descriptions, marketing copy, or even creative content.
Question answering: Empower your applications with the ability to answer user queries in an informative and comprehensive manner.
Code generation: Generate code snippets based on natural language descriptions, accelerating development processes.

Bedrock offers flexible deployment options, allowing you to choose between managed and self-managed instances, and provides transparent pricing based on usage.

Takeaways

RAG Framework: Seamlessly incorporate your domain expertise into existing language models without extensive training.
Skip building models from scratch and accelerate time-to-market.
Leverage a variety of AWS services, such as vector databases and Amazon Bedrock, to tailor your RAG solution.
Select the tools best suited to your data, performance, and cost requirements.

Appendix

Word embeddings are vector representations of words within a N-dimensional vector space.

Consider the following 2-dimensional vector space example:

Semantically different words are represented far from each other.
Semantically similar words, instead, are vectors close to each other.

When searching for information within a vector database, the query is projected within the same vector space of the word embeddings.

For instance, consider the query “synonyms of happy”:

Similarity measures are meant to retrieve the closest embedding found, with respect to the input query.

In this simple example, “happy” is projected close to its most semantically similar words — i.e., the “joyful”, “elated”, “excited” word embeddings cluster.

In the domain of Text Retrieval (TR), the objective is to retrieve, given a user query, a subset of documents from a Data Lake that exhibit the highest degree of semantic similarity or relevance to the query.

Imagine having N documents, each represented as a data point within a 2-Dimensional Euclidean space.

Imagine now to project a query into the same 2-Dimensional Euclidean space (the blue point).

The most simple semantic similarity measure you can think of, indeed, is the 2-Dimensional Euclidean distance between the query and the N documents.