The monitoring stack (Prometheus, Alertmanager, Grafana) comes with your OpenShift Container Platform 4 cluster during installation. It is based on the Prometheus open source project and gives you a good overview of your cluster components.
Assuming you’ve successfully deployed your OpenShift 4 cluster, your monitoring stack is working fine and you’re able to browse to the Grafana web UI. Now what?
Let’s have a quick look at the different metrics and dashboards you can explore on Grafana that comes out of the box with your cluster.
Accessing Grafana
Navigate to your Grafana dashboard by retrieving the route using oc get routes -n openshift-monitoring
or go to your cluster’s web console and look for ‘Monitoring’ → ‘Dashboards’.
To view the Monitoring tab on the web console or the Grafana dashboard, you will need a user with a cluster-admin role or you can grant a user read access to the openshift-monitoring namespace — $ oc adm policy add-cluster-role-to-user cluster-monitoring-view <username>
Note: The Grafana analytics platform provides dashboards for analyzing and visualizing the metrics. The Grafana instance that is provided with the monitoring stack, along with its dashboards, is read-only.
To customize a Grafana dashboard you have to deploy a separate instance of Grafana using the community Grafana operator.
You will first see an empty home dashboard. Fret not! There are a few important metrics you can first look at. For example, you might want to examine the overall health of your cluster.
Click on ‘Home’ on the top left of the screen, you will see a list of default dashboards.
This is the first blog post in a series where we look at each dashboard in turn to understand what each panel and metric mean.
etcd Dashboard
The etcd dashboard provides some key metrics on your etcd cluster’s health and performance. You can find the etcd dashboard template here.
Taking each panel in turn from left to right, starting with the first row…
1. etcd leader existence
The target of this panel takes the sum of the metric etcd_server_has_leader
to track the total number of etcd members.
2. gRPC metrics
gRPC is an open-source Remote Procedure Call framework that is used for high-performance communication between services. etcd uses gRPC to communicate between each of the nodes in the cluster.
RPC Rate: grpc_server_started_total
gives you the total number of RPCs started on the server.
RPC Failed Rate: grpc_server_handled_total
is the total number of RPCs completed on the server, regardless of success or failure. To measure RPC failed rate, it is when the status code returnsgrpc_code!=OK
after the call completes.
3. etcd active streams
Active streams are given by: sum(grpc_server_started_total) — sum(grpc_server_handled_total)
# Watch streamssum(grpc_server_started_total{job=\”$cluster\”,grpc_service=\”etcdserverpb.Watch\”,grpc_type=\”bidi_stream\”}) — sum(grpc_server_handled_total{job=\”$cluster\”,grpc_service=\”etcdserverpb.Watch\”,grpc_type=\”bidi_stream\”})# Lease streamssum(grpc_server_started_total{job=\"$cluster\",grpc_service=\"etcdserverpb.Lease\",grpc_type=\"bidi_stream\"}) - sum(grpc_server_handled_total{job=\"$cluster\",grpc_service=\"etcdserverpb.Lease\",grpc_type=\"bidi_stream\"})
Watches are long-running requests and use gRPC streams to stream event data. A watch stream is bi-directional; the client writes to the stream to establish watches and reads to receive watch events. A single watch stream can multiplex many distinct watches by tagging events with per-watch identifiers. This multiplexing helps reducing the memory footprint and connection overhead on the core etcd cluster.
Leases are a mechanism for detecting client liveness. The cluster grants leases with a time-to-live. A lease expires if the etcd cluster does not receive a keepAlive within a given TTL period.
https://etcd.io/docs/v3.4.0/learning/api/
4. etcd database size
etcd_debugging_mvcc_db_total_size_in_bytes
tracks the total size of the underlying database in bytes.
5. Disk write performance
These metrics describe the status of the disk operations. This is important to monitor etcd performance. Proposals are written to disk and fsync-ed before followers can acknowledge a proposal from the leader.
WAL fsync: A wal_fsync
is called when etcd persists its log entries to disk before applying them.
DB fsync: A backend_commit
is called when etcd commits an incremental snapshot of its most recent changes to disk.
High disk operation latencies (etcd_disk_wal_fsync_duration_seconds
or etcd_disk_backend_commit_duration_seconds
) often indicate disk issues. It may cause high request latency or make the cluster unstable.
6. etcd memory
process_resident_memory_bytes
is the resident memory size in bytes. That is the number of memory pages the process has in real memory, with pagesize, and excludes swapped out memory pages.
https://povilasv.me/prometheus-go-metrics/
7. etcd network
Client traffic in: etcd_network_client_grpc_received_bytes_total
gives the total number of bytes received by grpc clients.
Client traffic out: etcd_network_client_grpc_sent_bytes_total
gives the total number of bytes sent to grpc clients.
etcd_network_peer_received_bytes_total
counts the total number of bytes received from a specific peer. Usually follower members receive data only from the leader member.
etcd_network_peer_sent_bytes_total
counts the total number of bytes sent to a specific peer. Usually the leader member sends more data than other members since it is responsible for transmitting replicated data.
https://github.com/etcd-io/etcd/blob/v3.2.17/Documentation/metrics.md#network
8. etcd server metrics
Write and configuration changes sent to etcd are called proposals. The four metrics etcd_server_proposals_committed_total
, etcd_server_proposals_applied_total
, etcd_server_proposals_pending
and etcd_server_proposals_failed_total
track the proposals sent to etcd.
The raft protocol ensures that proposals sent to etcd are applied successfully.
proposals_committed_total
records the total number of consensus proposals committed. This gauge should increase over time if the cluster is healthy. Several healthy members of an etcd cluster may have different total committed proposals at once. This discrepancy may be due to recovering from peers after starting, lagging behind the leader, or being the leader and therefore having the most commits. It is important to monitor this metric across all the members in the cluster; a consistently large lag between a single member and its leader indicates that member is slow or unhealthy.
proposals_applied_total
records the total number of consensus proposals applied. The etcd server applies every committed proposal asynchronously. The difference between proposals_committed_total
and proposals_applied_total
should usually be small (within a few thousands even under high load). If the difference between them continues to rise, it indicates that the etcd server is overloaded. This might happen when applying expensive queries like heavy range queries or large txn operations.
proposals_pending
indicates how many proposals are queued to commit. Rising pending proposals suggests there is a high client load or the member cannot commit proposals.
proposals_failed_total
are normally related to two issues: temporary failures related to a leader election or longer downtime caused by a loss of quorum in the cluster.
leader_changes_seen_total
counts the number of leader changes the member has seen since its start. Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.
https://github.com/etcd-io/etcd/blob/v3.2.17/Documentation/metrics.md#server
That is all for Part I of this series on exploring Grafana that comes out of the box with your OpenShift 4 cluster. Stay tuned for the next post where we will take a closer look at the other dashboards!