Google Cloud Pub/Sub Reliability Guide: Part 2 Subscribing

Kir Titievsky
Google Cloud - Community
4 min readOct 19, 2020

I recommend reading through Part 1 before starting on this one. This post is part of a series to will help users of Google Cloud Pub/Sub write reliable applications that use the service. Yes, Pub/Sub is very reliable and highly available. No, it is not perfectly reliable. And that is where you, as an application developer, come in. My hope is that this set of articles will give you the background to design for extreme reliability. The articles are written by the product manager of Cloud Pub/Sub with a lot of help from Kamal Aboul-Hosn and others.

General difference between publisher and subscriber reliability

The job of a subscriber application is to process the messages as they come and acknowledge them to Cloud Pub/Sub. The progress a subscriber makes in acknowledging messages can be monitored by the size of the backlog (or number of messages not yet acknowledged) and the age of the oldest unacknowledged message, as described in the Cloud Pub/Sub Monitoring Guide. Subscribe issues lead to at least one of these metrics to start growing.

Many of the concepts discussed in publishing apply here. Subscribe APIs, including pull and streamingPull, are designed to offer regional isolation of failures. Sections on general types of API unavailability and retries apply. The key difference between subscriber and publisher applications is that the location of message data is determined by the publish operation: for a subscriber in region B to access messages published in region A, Cloud Pub/Sub must be up and available in both regions. Regional endpoints are a frequently useful tool in considering failover strategies.

A broad strategy for recovering from a regional failure is illustrated below.

Strategies for recovering from regional failures.
Completing the mitigation of a regional failure.

Another type of failure to consider is a network partition between two regions. Regional partitions can be detected by monitoring regional backlog size and age of the oldest unacknowledged message in addition to the global one. In general, alerts on the global age of the oldest unacknowledged message (topic/oldest_unacked_message_age metric) will lead you to investigate regional status (topic/oldest_unacked_message_age_by_region metric) so regional alerts may not be required. Diagrams below illustrate your options in this case.

In addition, subscriber applications must take into account the failover or redundancy strategy chosen by the publisher. If the publisher application publishes to two separate topics, the subscriber may either retrieve data from both and deduplicate it; or subscribe to a subscription to one of the topics and fail over to a subscription from the other in case of problems, reprocessing messages unacknowledged in the failover subscription but acknowledged in the failed one.

Stuck messages

It is important to detect and prepare for cases where the service remains accessible but a subset of messages cannot be delivered to the subscriber application. The messages are said to be “stuck,” which is reflected by the growing message backlog or the age of the oldest unacknowledged message in one or more regions.

Messages can be stuck on clients or in the service. Messages can become stuck on a client if a client requests more messages than it can process and keeps sending modifyAckDeadline requests. During this time, Pub/Sub tries not to send the message to other clients. Client Libraries automatically extend the deadline for messages a client instance retrieves up to a deadline. The deadline should be tuned based on your goals: too short of a deadline may result in many duplicate deliveries of the message while too long of a deadline might lead to messages getting stuck on machines that are overloaded. The overall deadline can be configured in the Java client library, using setMaxAckExtensionPeriod and the max_lease_duration property of the FlowControl object in the Python library.

When it comes to service-side stuckness, unfortunately, there is little that can be done outside the service. Generally stuckness is rare, temporary and is monitored internally. However, it is always a good idea to submit a support request as soon as you suspect service-side stuckness to ensure this is not missed.

The critical task is to tell apart client and service-side issues. You might be able to detect client-side issues by looking for overloaded clients (high RAM, CPU or network utilization) or logging errors in downstream requests that result in extended processing times. For example, if processing a message involves a write to a database, repeated database connection errors may signal that you might be dealing with a database issue or a hot key. A generally good practice to log all message lifecycle events implemented, such as message receipt and acknowledgements, including message ID and a timestamp, at DEBUG logging level. It is important to include the message ID to connect different events. This will help you zero in on the issues quickly when they arise. When your client monitoring is insufficient to rule out client-side issues, client restarts will ensure that any client-side stuck messages are reset and redelivered. Of course, if the issue was caused by something specific to a given message, redelivery will land the system in the same state. For this reason, consider using dead-letter topics to limit the number of redeliveries.

Next steps

Take a look at Part 3 which covers administrative operations. And as always, there is much to be made of the product documentation. Your comments and feedback are read and appreciated.

--

--