High-resolution user-defined metrics for Google Managed Prometheus with managed collector

Published in

Google Cloud - Community

6 min readNov 17, 2022

This article published in collaboration with Scott Hsieh, Salem Amazit, Vannick Trinquier and Ethan Han.

Background

This guide introduces a solution to address the limitation of Google Cloud Managed Service for Prometheus (GMP) with managed collector. The solution discussed in this article provides additional flexibility to GMP with managed collector. It allows users to adjust scrape intervals to collect fine grain metrics at higher resolution.

GMP with managed collector natively supports scrape interval adjustment in PodMonitoring. Users can adjust scrape intervals for applications monitored by GMP using interval parameter in the PodMonitoring resource. GMP with managed collector disabled the configuration to override Global Scrape Interval for managed prometheus. The default Global Scrape Interval is configured to be 60 seconds.

The default behavior of GMP with managed collector does not support higher resolution metrics for cadvisor or other Kubernetes System metrics.

Higher resolution metrics are critical for monitoring dynamically changing environments. This is the case for mission critical systems such as:

The Financial Services Industry,
Autoscaling bursty workloads on Kubernetes clusters

Design

High frequency scrape can be achieved by using a custom Prometheus exporter and expose metrics in application to address the limitation of GMP with managed collectors.

Custom Prometheus exporters can be written with the official Prometheus Client Library in supported languages and expose the custom metrics in Prometheus format.

For example:

CPU/Memory usage
HTTP request success/failure count
API invocation latency

Managed collection can scrape these custom metrics with a minimum 5 seconds sampling interval by using PodMonitoring.

Step by Step Guide

Prerequisites

You will need to have the following prerequisites to start create your prometheus custom exporter:

Google Cloud project
Basic understanding of Google Kubernetes Engine
Basic understanding of Docker
Basic understanding of Prometheus

Enabled the following APIs:

service:container.googleapis.com

Project Settings

The customer exporter was written in Python, you may use other languages like Java or Go to write the custom exporter.

Refer to the following configuration for the project related parameters and configurations. Please change the configurations accordingly.

Zone: asia-southeast1-a
Project ID: gmp-demo-project
GKE Cluster name: gmp-cluster
GKE Cluster namespace: gmp-test

All commands will be executed in Cloud Shell.

Initialize Project

You may replace each value to your own configurations.

Configure PROJECT_ID value to gmp-demo-project

export PROJECT_ID=gmp-demo-project

2. Configure ZONE value to asia-southeast1-a

export ZONE=asia-southeast1-a

3. Configure CLUSTER value to gmp-cluster

export CLUSTER=gmp-cluster

4. Configure NAMESPACE value to gmp-test

export NAMESPACE=gmp-test

Deploy Kubernetes Cluster

Set your project to gmp-demo-project with environment variable PROJECT_ID.

gcloud config set project $PROJECT_ID

2. Deploy GKE cluster with managed prometheus enabled. The GKE cluster will be deployed in asia-southeast1-a with environment variable ZONE.

gcloud beta container clusters create $CLUSTER --num-nodes=3 --zone $ZONE --enable-managed-prometheus

3. Retrieve GKE cluster credentials

gcloud container clusters get-credentials $CLUSTER --zone $ZONE --project $PROJECT_ID

4. Create namespace for the GKE cluster

kubectl create ns $NAMESPACE

Package Custom Exporter

We need to create an application, package and store it in the Container Registry.

Create a folder gmp-demo-application

mkdir gmp-demo-application && cd gmp-demo-application

2. In the gmp-demo-application, create a python file and name it as app.py. This application will collect Pod CPU utilization and expose through metrics endpoint.

#The content of this page is licensed under the Creative Commons Attribution 
#4.0 License, and code samples are licensed under the Apache 2.0 License

import logging
import random
import time

import psutil
from flask import Flask
from prometheus_client import Counter, Gauge, generate_latest, Histogram, REGISTRY

logger = logging.getLogger(__name__)

app = Flask(__name__)

CONTENT_TYPE_LATEST = str('text/plain; version=0.0.4; charset=utf-8')

number_of_requests = Counter(
    'number_of_requests',
    'The number of requests, a counter so the value can increase or reset to zero.'
)

custom_collector_cpu_usage = Gauge(
    'custom_collector_cpu_usage',
    'The current value of cpu usage, a gauge so it can go up or down.',
    ['server_name']
)

PYTHON_REQUESTS_COUNTER = Counter("python_requests", "total requests")
PYTHON_FAILED_REQUESTS_COUNTER = Counter("python_failed_requests", "failed requests")
PYTHON_LATENCIES_HISTOGRAM = Histogram(
    "python_request_latency", "request latency by path"
)


@app.route('/metrics', methods=['GET'])
def get_data():
    """Returns all data as plaintext."""
    number_of_requests.inc()
    custom_collector_cpu_usage.labels('custom_collector_cpu_usage').set(int(psutil.cpu_percent() * 10))
    return generate_latest(REGISTRY), 200


@app.route("/")
# [START monitoring_sli_metrics_prometheus_latency]
@PYTHON_LATENCIES_HISTOGRAM.time()
# [END monitoring_sli_metrics_prometheus_latency]
def homepage():
    # count request
    PYTHON_REQUESTS_COUNTER.inc()
    # fail 10% of the time
    if random.randint(0, 100) > 90:
        PYTHON_FAILED_REQUESTS_COUNTER.inc()
        # [END monitoring_sli_metrics_prometheus_counts]
        return ("error!", 500)
    else:
        random_delay = random.randint(0, 5000) / 1000
        # delay for a bit to vary latency measurement
        time.sleep(random_delay)
        return "home page"


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

3. Create a Dockerfile with the following content.

#The content of this page is licensed under the Creative Commons Attribution 
#4.0 License, and code samples are licensed under the Apache 2.0 License
FROM python:latest

COPY . .

RUN pip3 install -r requirements.txt

EXPOSE 8080

CMD python3 app.py

4. Create a requirements.txt with the following content.

prometheus_client
psutil
flask

5. Build Docker container by execute the following command

docker build -t custom-collector-demo:0.1 .

6. Tag Docker container.


docker tag custom-collector-demo:0.1 gcr.io/$PROJECT_ID/custom-collector-demo:0.1

7. Push docker container to GCR under your project.

docker push gcr.io/$PROJECT_ID/custom-collector-demo:0.1

Deploy Custom Exporter into GKE

Now we will create a deployment for the application as well as deploy a PodMonitoring custom resources (CRs).

Create a deployment yaml file named it as gmp-custom-exporter-deployment.yaml. Remember to change the gmp-demo-project to your project ID.

# The content of this page is licensed under the Creative Commons Attribution # 4.0
# License, and code samples are licensed under the Apache 2.0 License

apiVersion: apps/v1
kind: Deployment
metadata:
  name: flask-example
  labels:
    app: flask-example
spec:
  selector:
    matchLabels:
      app: flask-example
  replicas: 3
  template:
    metadata:
      labels:
        app: flask-example
    spec:
      containers:
      - image: gcr.io/gmp-demo-project/custom-collector-demo:0.1
        name: flask-example
        ports:
        - name: metrics
          containerPort: 8080

2. Create a custom resources and name it as gmp-podmonitoring.yaml. You may adjust internal value to control the sampling frequency, as the date of writing, the minimum sampling interval is 5 seconds.

# The content of this page is licensed under the Creative Commons Attribution 4.0
# License, and code samples are licensed under the Apache 2.0 License

apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: flask-example
  labels:
    app.kubernetes.io/name: flask-example
spec:
  selector:
    matchLabels:
      app: flask-example
  endpoints:
  - port: metrics
    interval: 5s

3. Deploy the deployment and PodMonitoring resources.

kubectl -n $NAMESPACE apply -f gmp-custom-exporter-deployment.yaml
kubectl -n $NAMESPACE apply -f gmp-podmonitoring.yaml

4. Optional step, you may check the Pod deployment status.

kubectl get pods -n $NAMESPACE -w

Visualize in Monitoring Dashboard

In this step, we will retrieve the custom metrics from Metrics Explore.

Visit the Monitoring page, and select Metrics explorer.

2. Click MQL and type the following query. custom_collector_cpu_usage is the custom metric we defined in app.py. Click RUN QUERY you will see the collected metrics appear on the dashboard.

fetch prometheus_target
| metric 'prometheus.googleapis.com/custom_collector_cpu_usage/gauge'

3. You may zoom in to see the value in the dashboard to only display the result for 30 seconds. The following two diagrams show the CPU utilization scraped at 5 second interval.