Google Cloud Pub/Sub: How to Monitor the Health of your Subscription for optimal end-to-end Latency
Pub/Sub offers durable message delivery with high availability and consistent performance at scale. The service is built on a core Google infrastructure component that many Google products, such as Ads, Search and Gmail, have relied upon for over a decade. While users can trust the underlying Google infrastructure, they can take some additional steps to ensure consistent low end to end latency of their messages. This blog post will discuss how you can measure the ‘health’ of your Cloud Pub/Sub Subscription when it comes to end to end latency.
As a Pub/Sub SRE, my colleagues and I are responsible for helping our customers with performance issues, among other things. What we have observed is that subscriptions with consistent low end to end latency all have some common characteristics. These characteristics are as follows:
- Subscriptions have little to no seek requests.
- Have little to no nacked messages.
- Have little to no expired acks.
- Have the 99.9pct of ack latency be less than 30 seconds
- Have consistently low utilization. For example, a push subscription would have fewer than 1000 outstanding messages at any given time or a pull subscription would have more than 20 RPCs outstanding at any given minute, or having a low streaming pull utilization per stream.
The subscriptions that have these characteristics are considered ‘healthy’ when it comes to having a consistent low end to end latency. However, note that it is completely reasonable to sometimes not be completely ‘healthy’. That is, sometimes you need to perform the above mentioned actions and be temporarily not ‘healthy.’ For example, there are legitimate reasons why one would need to do a seek request, such as recovering from an unexpected client bug or testing. But this action, unsurprisingly, will impact end to end latency. Similarly, if you nack a message or let the message expire without acknowledging it, Pub/Sub will redeliver it. These actions will, again, increase the latency. But sometimes your client may want to nack a message for a rare transient exception. Note that it is generally an anti-pattern to let the message expire; i.e neither ack nor nack a message. Doing so takes a big hit on the latency, as Pub/Sub will wait till the ack_deadline_seconds period before redelivering the message again.
So if your goal is to have a consistently low end to end latency, you may want to try to have your subscription be in a ‘healthy’ state as much as possible.
Delivery Latency Health Score
In order to help users monitor the latency ‘health’ of their subscriptions, we are exporting a new metric via Cloud Monitoring called Delivery Latency Health Score metric. The metric is True when it meets it’s given criteria for all of the last 10 minutes. Possible values for the criterias are ack_latency, expired_ack_deadlines, nack_requests, seek_requests, or utilization. In order to consistently get low end to end latency, a subscription should be True most of the time.
You may want to assign an actual score to your subscription based on this metric such that you get a ‘score’ of 5 when all the criterias are healthy or a score of 3 when two of them are unhealthy. You can easily achieve that by creating a chart in Cloud Monitoring using an MQL query similar to this:
fetch pubsub_subscription| metric 'pubsub.googleapis.com/subscription/delivery_latency_health_score'| filter
resource.project_id == 'YOUR_PROJECT_ID'
&& (resource.subscription_id == 'YOU_TOPIC')| align next_older(1m)| every 1m| group_by [resource.subscription_id, resource.project_id], [value_delivery_latency_health_score_count_true: count_true(value.delivery_latency_health_score)]
If you want to monitor individual criterias just add metric.criteria in your group_by clause.
Other Helpful Metrics
Furthermore, in order to dig deeper into individual criterias, user can monitor the following metrics:
We will be exporting metrics utilization and nack_request_count separately in the near future. Also keep in mind that utilization criteria is tricky and may appear as a black box. We will discuss more on how we calculate the utilization criteria in a separate blog in the future.
Optionally, you can create an alerting policy that will fire when the delivery_latency_health_score goes below a threshold in the context of your system. For instance, you may want to be alerted if you see a subscription with a consistent score of 3 or low for more than an hour. Or if you are more interested in a specific criteria, you can set up an alert for that criteria. For example, you can create an alert when ack_latencies is consistently above a certain threshold. High ack latency means that the client is taking an abnormally long time to process a message. This can imply either a bug or some sort of resource constraints on the client side.
If you are interested in learning more about Cloud Pub/Sub reliability, I would recommend that you read the excellent three part reliability guide by Dr. Kir Titievsky. To see the additional metrics that Pub/Sub reports to Cloud Monitoring, view the Metrics List in the Cloud Monitoring documentation.