Building an intelligent alerting with Gemini & Function Calling

CC
Google Cloud - Community
5 min readApr 11, 2024

Introduction

We use alerting to send an alarm to the operation team to solve or mitigate problems as soon as possible. It’s normally going to take a relatively long time to investigate issues and then come up with clear solutions to combat the issues. I’ve always imagined it would be nicer to have a altering along side a solution, in turn helping the engineer quickly work on problem solving rather than tedious process. The Large Language Models (LLMs) provide us with this capability to move towards this direction. This article is going to explore how we can leverage Gemini and native features to address these issues.

Let’s assume our real world scenarios are the following:

  • GKE as production environment
  • Microservices are running on top of GKE
  • Using Google Operation Suite as a centralized monitoring platform
  • We’re getting an alerting message triggered by breaching the metric - kubernetes.io/container/restart_count from a GKE cluster.

In traditional practice when an operational engineer was paged by an alert, what probably would have happened:

  1. Interpreted the message.
  2. Configured the credential to the target GKE cluster in order to collect information.
  3. Ran proper commands or tools in order to collect whatever information is necessary.
  4. Analyzed variety of collected information
  5. Composed an incident report with a potential solution
  6. Sent it to a related person in Chat to fix or further investigate.

The LLM can help with analyzing and reasoning the alert, and work as a centralized control plane to handle various issues. However as a LLM, which doesn’t have access to real time data or any external APIs and services, they are constrained to the information and knowledge that they were trained on. This can lead to frustration for end users who are trying to use the LLM to work with the most up-to-date-information from external systems. Function calling is a feature of Gemini 1.0 Pro models, which allows developers to do exactly this and helps you connect generative models to real-world data via API calls.

Function call: The bridge to external world

Function calling allows developers to define custom functions that can output structured data from generative models and invoke external APIs. This enables LLMs to access real-time information and interact with various services, such as SQL databases, customer relationship management systems, document repositories, and anything else with an API.

The following diagram illustrates a sequence of interactions between the user, the application, the model, and the function API. It represents a complete text modality set of interactions or a single conversation turn of a chat modality.

If you want to know more about how function calling works, here’s a detailed link.

Let’s dive into the detail how we are going to implement this idea to leverage the power of LLMs.

Architecture overview

This is a high level diagram of how the workflow looks like regarding to our previous vision.

The major sequence is composed of the following steps:

  1. Altering is triggered by the breach threshold of the metric which could be native metrics or customized one.
  2. The Cloud Run service will call the model to determine what the proper solution would be as well as mapped functions are going to be invoked.
  3. Execute each step as per proposed solution.
  4. The model will summarize the result of executing the solution and aggregated a report.
  5. The Cloud Run service will push the report to Chat.

Build an intelligent alerting handler

At the heart of intelligent alerting is the handler interpreting the alerting message and taking necessary actions, such as collecting logs, events, etc. Then put things together for further solution.

Firstly, we define each individual tool as a Function Declaration in order to help us configure credentials to the GKE cluster and collect information about the pod. Here are the declarations:

# Function to get GKE credential 
get_credentail_func = FunctionDeclaration(
name="get_credential",
description="Configure the credential and connect to the GKE cluster.",
parameters={
"type": "object",
"properties": {
"cluster_name": {
"type": "string",
"description": "The name of the Kubernetes cluster"
},
"region": {
"type": "string",
"description": "The region of the Kubernetes cluster"
},
"project_id": {
"type": "string",
"description": "The project ID of the Kubernetes cluster"
},
"isZonal": {
"type": "boolean",
"description": "If the cluster is zonal, set this to True, otherwise set this to False"
}
},
"required": [
"cluster_name",
"region",
"project_id",
"isZonal"
],
"required": [
"cluster_name",
"region",
"project_id"
]
}
)

# Function to collect information of issued pod and analyse
collect_pod_information_fun = FunctionDeclaration(
name="collect_pod_information",
description="""
Collect pod information from the GKE cluster.
""",
parameters={
"type": "object",
"properties": {
"namespace_name": {
"type": "string",
"description": "The namespace where the pod is located, which is a lowercase RFC 1123 label must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?')"
},
"pod_name": {
"type": "string",
"description": "The name of the pod to get by"
},
"kubernetes_context": {
"type": "string",
"description": "The kubernetes context to use, defaults to the current context"
}
},
"required": [
"kubernetes_context"
],
},
)

Secondly, we’re going to put tools together as a toolset and leverage Function Calling of Gemini to determine which tool should be used in order to achieve the goal.

# Function calling
gke_cluster_tool = Tool (
function_declarations= [
get_credentail_func,
collect_pod_information_fun
]
)

...

# Setup model and temperature
model = GenerativeModel("gemini-1.0-pro-002",
generation_config={"temperature": 0.5}, tools=[gke_cluster_tool])
chat = model.start_chat(response_validation=False)

Lastly we prepare a prompt to instruct the model on what to do when an alerting message is hitting.

@app.post("/alerting")
def analyse_alerting(message: Union[str, dict]) -> dict:

prompt = """
You are a Kubernetes expert and highly skilled in all Google Cloud services, Linux, and shell scripts.
Your task is to troubleshoot the problematic pod as per CONTEXT with the following steps:

1. Configure the credential and connect to the GKE cluster.
2. Collect the pod information from the GKE cluster.
3. Provide a summary of pod, a concise explanation of the issue and follow by step by step solutions to address the issues.

Only use information that provided, do not make up information.
CONTEXT:
{}
""".format(message["incident"]["resource"]["labels"])

response = chat.send_message(prompt)
...

Finally we see how the magic happens in your Chat when the alerting is triggered. Here is what I had:

Summary

To deploy the reference architecture that this document describes, see the Building an intelligent alerting with Gemini & Function calling on Google Cloud GitHub repository.

The foundational model can help you with reasoning as well as aggregate a proper solution from provided functions. Gemini is a very powerful foundational model you can harness to build more complex functions than what was built in this demo. I believe an end to end automated process can make the operational engineers’ lives so much easier.

--

--

CC
Google Cloud - Community

Application Modernisation Specialist at Google Cloud. Great passion for technology.