OpenShift 4 Monitoring — Exploring Grafana (Part I)

Jane Ho
6 min readMay 12, 2020

--

The monitoring stack (Prometheus, Alertmanager, Grafana) comes with your OpenShift Container Platform 4 cluster during installation. It is based on the Prometheus open source project and gives you a good overview of your cluster components.

Assuming you’ve successfully deployed your OpenShift 4 cluster, your monitoring stack is working fine and you’re able to browse to the Grafana web UI. Now what?

Let’s have a quick look at the different metrics and dashboards you can explore on Grafana that comes out of the box with your cluster.

Accessing Grafana

Navigate to your Grafana dashboard by retrieving the route using oc get routes -n openshift-monitoring or go to your cluster’s web console and look for ‘Monitoring’ → ‘Dashboards’.

To view the Monitoring tab on the web console or the Grafana dashboard, you will need a user with a cluster-admin role or you can grant a user read access to the openshift-monitoring namespace — $ oc adm policy add-cluster-role-to-user cluster-monitoring-view <username>

Note: The Grafana analytics platform provides dashboards for analyzing and visualizing the metrics. The Grafana instance that is provided with the monitoring stack, along with its dashboards, is read-only.

To customize a Grafana dashboard you have to deploy a separate instance of Grafana using the community Grafana operator.

https://grafana-openshift-monitoring.apps.<cluster-id>.<base-domain>

You will first see an empty home dashboard. Fret not! There are a few important metrics you can first look at. For example, you might want to examine the overall health of your cluster.

Click on ‘Home’ on the top left of the screen, you will see a list of default dashboards.

This is the first blog post in a series where we look at each dashboard in turn to understand what each panel and metric mean.

etcd Dashboard

The etcd dashboard provides some key metrics on your etcd cluster’s health and performance. You can find the etcd dashboard template here.

Taking each panel in turn from left to right, starting with the first row…

1. etcd leader existence

etcd_server_has_leader

The target of this panel takes the sum of the metric etcd_server_has_leader to track the total number of etcd members.

2. gRPC metrics

grpc_server_started_total and grpc_server_handled_total

gRPC is an open-source Remote Procedure Call framework that is used for high-performance communication between services. etcd uses gRPC to communicate between each of the nodes in the cluster.

RPC Rate: grpc_server_started_total gives you the total number of RPCs started on the server.

RPC Failed Rate: grpc_server_handled_total is the total number of RPCs completed on the server, regardless of success or failure. To measure RPC failed rate, it is when the status code returnsgrpc_code!=OK after the call completes.

3. etcd active streams

sum(grpc_server_started_total) — sum(grpc_server_handled_total)

Active streams are given by: sum(grpc_server_started_total) — sum(grpc_server_handled_total)

# Watch streamssum(grpc_server_started_total{job=\”$cluster\”,grpc_service=\”etcdserverpb.Watch\”,grpc_type=\”bidi_stream\”}) — sum(grpc_server_handled_total{job=\”$cluster\”,grpc_service=\”etcdserverpb.Watch\”,grpc_type=\”bidi_stream\”})# Lease streamssum(grpc_server_started_total{job=\"$cluster\",grpc_service=\"etcdserverpb.Lease\",grpc_type=\"bidi_stream\"}) - sum(grpc_server_handled_total{job=\"$cluster\",grpc_service=\"etcdserverpb.Lease\",grpc_type=\"bidi_stream\"})

Watches are long-running requests and use gRPC streams to stream event data. A watch stream is bi-directional; the client writes to the stream to establish watches and reads to receive watch events. A single watch stream can multiplex many distinct watches by tagging events with per-watch identifiers. This multiplexing helps reducing the memory footprint and connection overhead on the core etcd cluster.

Leases are a mechanism for detecting client liveness. The cluster grants leases with a time-to-live. A lease expires if the etcd cluster does not receive a keepAlive within a given TTL period.

https://etcd.io/docs/v3.4.0/learning/api/

4. etcd database size

etcd_debugging_mvcc_db_total_size_in_bytes

etcd_debugging_mvcc_db_total_size_in_bytes tracks the total size of the underlying database in bytes.

5. Disk write performance

etcd_disk_wal_fsync_duration_seconds_bucket and etcd_disk_backend_commit_duration_seconds_bucket

These metrics describe the status of the disk operations. This is important to monitor etcd performance. Proposals are written to disk and fsync-ed before followers can acknowledge a proposal from the leader.

WAL fsync: A wal_fsync is called when etcd persists its log entries to disk before applying them.

DB fsync: A backend_commit is called when etcd commits an incremental snapshot of its most recent changes to disk.

High disk operation latencies (etcd_disk_wal_fsync_duration_seconds or etcd_disk_backend_commit_duration_seconds) often indicate disk issues. It may cause high request latency or make the cluster unstable.

6. etcd memory

process_resident_memory_bytes

process_resident_memory_bytes is the resident memory size in bytes. That is the number of memory pages the process has in real memory, with pagesize, and excludes swapped out memory pages.

https://povilasv.me/prometheus-go-metrics/

7. etcd network

etcd_network_client_grpc_received_bytes_total and etcd_network_client_grpc_sent_bytes_total

Client traffic in: etcd_network_client_grpc_received_bytes_total gives the total number of bytes received by grpc clients.

Client traffic out: etcd_network_client_grpc_sent_bytes_total gives the total number of bytes sent to grpc clients.

etcd_network_peer_received_bytes_total and etcd_network_peer_sent_bytes_total

etcd_network_peer_received_bytes_total counts the total number of bytes received from a specific peer. Usually follower members receive data only from the leader member.

etcd_network_peer_sent_bytes_total counts the total number of bytes sent to a specific peer. Usually the leader member sends more data than other members since it is responsible for transmitting replicated data.

https://github.com/etcd-io/etcd/blob/v3.2.17/Documentation/metrics.md#network

8. etcd server metrics

etcd_server_proposals_failed_total, etcd_server_proposals_pending, etcd_server_proposals_committed_total and etcd_server_proposals_applied_total

Write and configuration changes sent to etcd are called proposals. The four metrics etcd_server_proposals_committed_total, etcd_server_proposals_applied_total, etcd_server_proposals_pending and etcd_server_proposals_failed_total track the proposals sent to etcd.

The raft protocol ensures that proposals sent to etcd are applied successfully.

proposals_committed_total records the total number of consensus proposals committed. This gauge should increase over time if the cluster is healthy. Several healthy members of an etcd cluster may have different total committed proposals at once. This discrepancy may be due to recovering from peers after starting, lagging behind the leader, or being the leader and therefore having the most commits. It is important to monitor this metric across all the members in the cluster; a consistently large lag between a single member and its leader indicates that member is slow or unhealthy.

proposals_applied_total records the total number of consensus proposals applied. The etcd server applies every committed proposal asynchronously. The difference between proposals_committed_total and proposals_applied_total should usually be small (within a few thousands even under high load). If the difference between them continues to rise, it indicates that the etcd server is overloaded. This might happen when applying expensive queries like heavy range queries or large txn operations.

proposals_pending indicates how many proposals are queued to commit. Rising pending proposals suggests there is a high client load or the member cannot commit proposals.

proposals_failed_total are normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.

etcd_server_leader_changes_seen_total

leader_changes_seen_total counts the number of leader changes the member has seen since its start. Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.

https://github.com/etcd-io/etcd/blob/v3.2.17/Documentation/metrics.md#server

That is all for Part I of this series on exploring Grafana that comes out of the box with your OpenShift 4 cluster. Stay tuned for the next post where we will take a closer look at the other dashboards!

--

--