Use Google Managed Prometheus and Triton Inference Server on GKE to Simplify LLM observability and monitoring

Published in

Google Cloud - Community

8 min readMar 13, 2024

Background

Large Language Models (LLMs) are revolutionizing various industries, from chatbots to content creation. However, their deployment and maintenance demand careful observability information including monitoring metrics and tracing. You need insights into critical metrics such as CPU/GPU usage, latency, throughput, error rates, and resource utilization to ensure optimal performance and catch potential issues early.

Traditional monitoring setups can be complex, especially in a Kubernetes environment. Popular tools like self managed local Prometheus offer powerful metrics collection but add operational overhead and lack of persistence as well as centralized views. On the other hand, inference servers like Huggingface TGI are essential to offer entry level model serving but may miss same level of performance metrics comparing to native solutions from Nvidia and Triton Servers.

Introduction

Here’s where a powerful combination comes into play to simplify LLM inference monitoring metrics: Google Managed Prometheus(GMP), Triton Inference Server model metrics, and Google Kubernetes Engine (GKE). GMP managed metric collections eliminates the hassle of operating your own Prometheus instance locally. Triton inference server provides optimized model serving through various backend frameworks such vLLM, TensorRT, PyTorch, ONNX and TensorFlow etc, and GKE offers a scalable Kubernetes platform to deploy AI ML training and inference workloads. More importantly, Triton Inference Server includes a lot of LLM enterprise operations tools such as model analyzer, model management, rate limit, trace etc

In this blog, we’ll dive into how to integrate these technologies to establish a simplifed while robust LLM monitoring solution. You’ll learn how to capture essential metrics, build insightful dashboards, and proactively manage your LLM deployments.

Prerequisites

Access to a Google Cloud project with the L4 GPUs available and enough quota in the region you select.

A computer terminal with kubectl and the Google Cloud SDK installed. From the GCP project console you’ll be working with, you may want to use the included Cloud Shell as it already has the required tools installed.

Some models such as Llama2 will need Huggingface API token to download model files

Meta access request: https://ai.meta.com/resources/models-and-libraries/llama-downloads/ need regisgter an email address to download

Go to Hugging face, create account account with same email registered in Meta request. Then find Llama 2 model, fill out access request: https://huggingface.co/meta-llama/Llama-2-7b. Need to wait for a few hours with the approval email to be able to use Llama 2

Get Hugging face access token from your huggingface account profile settings, you will need it in next steps

1 Setup GKE cluster and environments

To simplify the setup process, please download the sample repo from:

git clone https://github.com/llm-on-gke/triton-vllm-gke
cd triton-vllm-gke

Take a look at create-cluster.sh bootstrap script to provisioning GKE cluster, the scripts include the following steps:

GKE standard cluster
GKE spot nodepool with 1 L4 GPU accelerator ( g2-standard-8)
IAM permissions
GKE secret to store Huggingface access token

Update the bootstrap script to fit your need accordly, especially the first two line:

export PROJECT_ID=<your-project-id>
export HF_TOKEN=<paste-your-own-token>

Then run the bootstrap script to provisin base environments:

chmod +x create-cluster.sh
./create-cluster.sh

Wait 5–10 minute still the scripts completed successfully, and you may execute the following to connect to the GKE cluster ( update the command with proper location and cluster name):

gcloud container clusters get-credentials triton-inference --location us-central1
kubectl get ns

Please be noted, GMP and managed collection already enabled on GKE Standard clusters running GKE version 1.27 or greater

2 Upload the triton vLLM backend model config settings to cloud storage bucket

Before we can deploy the any OpenSource LLM model, need to prepare the backend python code and config settings so that it can be available when LLM model is deployed through Triton Inference server

Look at the sample folder structure under model_repository sub-folder:

There are 2 files included, model.py which is pure python code to load vLLM backed models from Huggingface, while config.pbtxt is used to specify inferencing options.

You don’t need to make any update to the files since it can be applied to most of vLLM backend LLM models( Llama2, Mistral, Falcon, GPT etc), instead just run the following commands to upload the sample model repository to a cloud storage bucket(replace your-bucket-name)

gsutil mb gs://your-bucket-name
gsutil cp -r model_repository gs://your-bucket-name/model_repository

3 Deploy LLama2 7b model to Triton Inference Server

Now, we are ready to deploy an OpenSource LLM model to Triton Inference Server through vLLM backend.

As example, take a look at vllm-gke-deploy.yaml downloaded and make updates on:

env:
  - name: model_name
    value: meta-llama/Llama-2-7b-chat-hf
args: ["tritonserver", "--model-store=gs://your-bucket-name/model_repository",

After the updates, you may run the following command to deploy Llam2 7b chat model or other OpenSource LLM to Triton Inference Server:

kubectl -n triton apply -f vllm-gke-deploy.yaml

The deployment will be completed after 7–10 minutes, and you may check the deployment status:

watch kubectl get po -n triton

This shows the deploy is ready, also, you may check the GKE workload console and logs screen for the pod with indication of healthy state for different ports( 8001 for GRPC, 8000 for HTTP, 8002 for Metrics),

4 Client side validations and metrics retrieval:

A sample client provided here to illustrate how you can execute client validations.

Go to client sub-folder downloaded,

cd client

You can check the cloudbuild.yaml file and replace gke-llm with your own Artifact Registry repo name,

steps:
- name: 'gcr.io/cloud-builders/docker'
  args: [ 'build', '-t', 'us-east1-docker.pkg.dev/$PROJECT_ID/gke-llm/triton-client:latest', '.' ]
images:
- 'us-east1-docker.pkg.dev/$PROJECT_ID/gke-llm/triton-client:latest'

Then run the following command to build your own client image:

gcloud builds submit .

Once the image is pushed to Artifact registry repo,

Run the following commands to kick off triton client app( update the project_id and repo where client image stored):

kubectl run -it -n triton --image us-east1-docker.pkg.dev/your-project/gke-llm/triton-client triton-client
kubectl exec -it -n triton triton-client -- bash

Ignore if the first command returns timeout error, once you are inside the client Pod prompt, you may do a few validation commands:

curl $TRITON_INFERENCE_SERVER_SERVICE_HOST:8000/v2
curl $TRITON_INFERENCE_SERVER_SERVICE_HOST:8002/metrics
python grpc-client.py

Make sure all the commands completed successfully.

5 Configure GMP managed collection to target with Triton Inference Server metrics

In previous validation steps, the following command will return the CPU/GPU related performance metrics values by on-demand query only.

curl $TRITON_INFERENCE_SERVER_SERVICE_HOST:8002/metrics

Now, it is time to setup a PodMonitoring resource in GMP to collect metrics available from Triton Inference Server metric ports and pass over to Cloud Monitoring system.

Run the following command to update GMP settings.

kubectl edit OperatorConfig -n gmp-public

Within the VI based kubernetes resource editor, insert( enter key i) the following section right above line start with metadata and make sure features and metadata both left aligned:

features:
      targetStatus:
        enabled: true

Save the Kubernetes VI editor, use Key Esc with :w and :x

Take a look vllm-podmointoring.yaml to create sample PodMonitoring resource, it points to the Triton Inference Server pods hosting LLM model and collect them from metrics port( 8002) every 10 seconds:

apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: triton-inference
spec:
  selector:
    matchLabels:
       app: triton-inference-server
  endpoints:
  - port: metrics
    interval: 10s

Then run the following command to create a PodMonitoring resource and target Triton Inference Server metric port:

kubectl -n triton apply -f vllm-podmonitoring.yaml

You can use the following command to make sure it starts to collect Triton Inference Server metrics

kubectl -n triton describe podmonitoring triton-inference

6 PromQL query and visualization on GMP metrics collected from Triton Inference Server

As managed Kubernetes platform, GKE already provides out of box GPU related dashboards in console by navigating to Kubernetes Engine/Clusters/Observability tab,

In addition, more GPU and Nvidia specific metrics from Triton Inference server metrics will be automatically collected through GMP and stored in centralized Google Cloud Monitoring system for long term persistence and ready to be queried and visualized in multiple ways.

First, you can perform PromQL Query using Cloud Monitoring through Metrics Explorer or create Dashboards from the metrics(start with nv_XXX in metrics explorer) provided by Triton Inference Server. Make sure you switch to PromQL in right side of Monitoring/Metric Explorer screen and start to type in Metric name starting with NV_ in Query box, and a list of Triton Inference Server merics will be available from dropdown list:

Alternatively, you can use Open Source tools such as Grafana or Prometheus UI to visualize or create your own monitoring dashboards based on metrics collected by GMP but as immediately available datasource plug into Grafana or Prometheus UI. Please follow the links provided for details.

Conclusion

In this blog, we walked through steps and demonstrated benefits of AI ML GKE infrastructure together with Google Managed Prometheus and Triton Inference Server to empower enterprise LLM OPS and Platform team with performance and monitoring tools can be applied to production environment and effectively manage and monitoring LLM inference workload observability.

Don’t forget to check out other GKE related resources on AI ML infrastructure offered by Google Cloud and check the resources included in the AI/ML orchestration on GKE documentation.

For your reference, the code snipptes listed in this blog can be find in this source code repo