AIOps — Improve operational efficiency on AWS with AI & GenAI

Published in

Storm Reply

8 min read4 days ago

How to setup enhanced production monitoring leveraging Amazon Bedrock.

In traditional IT environments, Cloud Operations teams often rely on ITIL (Information Technology Infrastructure Library) guidelines to manage and monitor production workloads. ITIL provides a comprehensive framework for IT service management, ensuring that Key Performance Indicators (KPIs) are effectively measured and that predefined procedures are in place to handle incidents. This standard approach works well for common and predictable issues, where automated processes can resolve most incidents efficiently. However, challenges arise when dealing with unexpected events or complex incidents involving multiple assets, which require more sophisticated solutions to minimize downtime and false positives.

The main phases of monitoring and managing service availability is built up upon:

collection of focal KPIs (Key Performance Indicators) that measures how effectively the system achieves the expected performance, and mapping them on a specific level of priority
deployment of one or more probes for each KPI set at specific measurable thresholds, defining the minimum acceptable quality level for that dimension of the system
setup of alarms via notification process which alerts all the relevant stakeholders in case of breaching of the threshold
definition of a standard procedure for each alarm previously defined, with the purpose of restoring the on-going disruption of the system

This standard approach fits the trivial and recurrent use cases, where the procedure is automated and resolves the most of the Incidents.

However, when unexpected events or distributed incidents involving multiple assets occur simultaneously, a single predefined procedure may not suffice.

Moreover, we would like to spend time for in-depth troubleshooting for real incidents only, avoiding false positives.

This article aims to propose a technical solution to setup your custom monitoring system leveraging AWS services and Gen-AI to enhance the quality of your troubleshooting capabilities and to reduce MTTD (Mean Time To Detection) and MTTR (Mean Time To Resolution).

Goal of the solution

The focus of the solution is to provide direct support to on-duty operators on enterprise workloads during their day-by-day activities, both at functional and infrastructural levels, breaking down the time gap from when an Incident occurs to its resolution.

The operator will have access to a tool that provides smart resolution suggestions tailored on custom KPIs.

Amazon Bedrock

During our analysis, we were looking for the best AWS service to fit our use case. We examined different technologies:

Amazon SageMaker
Amazon Bedrock
Amazon Q Business

Amazon SageMaker service allows to create, train and implement ML models, leading to incredibly custom solutions. The drawback is the high requested effort and an overall average accuracy which is similar or even worst if compared to “Bedrock with RAG” solution.

Amazon Q Developer, on the other hand, is presented as already preconfigured: you just have to connect it to your data source in a click and you are ready to go. The drawback is the lack of elasticity in terms of customization, leading to a worst accuracy if compared to “Bedrock with RAG” solution.

Amazon Bedrock was the best compromise between the other 2 technologies, providing the best balance in terms of what is already pre-configured and what can be customizable to our use case, and let us achieve the best accuracy and quality of answers from the model.

Amazon Bedrock is a fully managed service by AWS that provides high-performing Foundation Models for building generative AI applications. It offers access to pre-trained AI models from leading providers, enabling developers to integrate advanced AI capabilities into their applications without the burden of developing their own models.

Amazon Bedrock can augment response generation with information from your data sources, can adapt models to specific tasks and domains with training data and can also prevent inappropriate or unwanted content by implementing safeguards guardrails.

RAG Pattern Architecture

Before deep diving into the technical solution, it is important to understand the high-level key features of the approach we are going to implement.

We are talking about RAG!

RAG stands for Retrieval-Augmented Generation. It optimizes the output of a Large Language Model by referencing an authoritative knowledge base outside its training data before generating a response. Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization’s internal knowledge base, all without the need to retrain the model. It is also a cost-effective approach.

The RAG architecture is described by the following image:

The overall process is split in 2 main phases:

in the first one, we feed the Embedding Model with the input documentation and populate the database with processed data
in the second one, we provide the input and retrieve the relevant information

The flow of this second phase is forked into 2 main parts:

a search engine pattern which leverages an Embedding Model and extract the high-quality relevant information that we were looking for
a LLM pattern which leverages a Foundation Model which manipulates the retrieved information and returns it to operator as a human readable answer

Now let’s take a look to technical implementation.

Setup Gen-AI monitoring with Amazon Bedrock

In this scenario, we setup Amazon Bedrock with RAG pattern architecture, since allows us to achieve better results in terms of accuracy and quality of the provided output.

The first step is to configure the 2 Amazon Bedrock services that will constitute the infrastructure:

the Embedding Model
the Foundation Model (LLM)

The Embedding Model will be configured with Titan Text Embeddings V2 provided by Amazon. It will receive text input (such as the alarm name) and return an array of embeddings.

The Foundation Model (LLM) will be configured with Claude 3.5 Sonnet provided by Antropics, the most recent and advanced AI released by Antropics on the 20th of June, 2024.

We choose Cloud 3.5 Sonnet bewteen the various Foundation Model because provided the best results and highest accuracy leveraging its advanced NLP capabilities such as deep text comprehension and text generation for complex context-sensitive tasks.

Now, we have to embed the knowledge base, in this case we used our internal enterprise documentation uploading it on an S3 Bucket in docx format. This process can be enhanced and automated to manage the update lifecycle of the documentation, but explaining this is outside of the scope of this article. Then, an AWS Lambda function will fetch it and split it in many chunks that will define the basic unit of the data.

Then in the embedding phase we feed the Embedding AI model with all the prepared chunks: these strings will be computed by the model which will return an array for each chunk.

Once completed, we call the Amazon Bedrock Model and store this information in a Postgres Vector DB as tuples, structured as follows:

<array>:<chunk>

This datastore will be accessed every time that the operator will provide an alarm name as input and will return a pre-defined number of procedures extracted from the most similar ones in the K-neighbor.

In this example, we are retrieving the 2 most similar procedures from the Vector DB:

Once this first setup is complete, we can move on to the Text Generation phase. The procedures, along with the alarm name, will compose the input to the Amazon Lambda function that uses it to query the LLM Foundation Model:

The model will return a refined answer containing all relevant information for resolving the production incident.

As final step, we expose a simple frontend to prompt-in the input and that will act as our virtual assistant. Bear in mind that the setup of the virtual assistant will not be discussed in this brief article, but we would like to share with you some basic ideas for an implementation:

developing a custom GUI to let the user to dynamically interact with the backend
integrating the backend with popular chat products, such as Microsoft Teams or Slack
directly invoking the backend with an API call to an Amazon API Gateway.

The setup phase is complete! Now we can try the solution providing the name of an alarm as input and the AI will return us a complete and detailed procedure on how to fix the problem which affects the production workload.

Demo of the solution

In the context of real-time monitoring system for production environments, the input will be the name of the alarm that describes the on-going Incident.

In our scenario, our monitoring system composes the name of the alarm with all relevant information about the impaired asset, such as:

name of the customer
name of the project
name of the resource
name of the breached threshold
value of the threshold

The process starts when the operator prompts the name of the alarm to the virtual assistant. The chatbot forwards the provided input to the Embedding Model which returns an array which represents the embedded alarm.

The Embedded Alarm is used as key to perform a cosine proximity search to find the K-Neighbors in the Postgres Vector DB and let us retrieve the most similar procedures.

Now, I compose the initial input and the retrieved output to compose a final question and enrich the prompt delivered to the LLM Foundation Model, which elaborates the provided information to return a complete, accurate and prettified answer which consists of the detailed step-by-step resolution procedure for the on-going Incident, enriched with suggestions tailored on customer workload.

Conclusion

In conclusion, Amazon Bedrock with RAG pattern allows to enhance production monitoring providing high-quality responses with specific details leveraging customer knowledge base.

We explored the main features of this AIOps framework, the benefits of adopting it, and how to set up its infrastructure. This approach can significantly reduce the time required to resolve complex production incidents.

AIOps — Improve operational efficiency on AWS with AI & GenAI

Written by FabrizioRonzino