Ensuring Service Continuity: How Circuit Breakers Safeguarded Our Web Push Notifications During the Google Incident

Published in

Insider Engineering

5 min readJul 8, 2024

Abstract

In May 2024, an unexpected outage in Google’s push messaging services disrupted our web push notification delivery to users. This blog details how we swiftly identified the issue, leveraging mechanisms like “circuit breakers” to mitigate the impact. Our lessons learned and proactive strategies to minimize service disruptions, underscoring the importance of agile response in maintaining customer satisfaction.

Problem

In May 2024, the prolonged outage we experienced was primarily caused by a misconfiguration made by developers on Google’s push messaging platform. Upon detailed analysis of the incident, it was revealed that an unexpected impact on push messaging services occurred during an update to Google’s APIs. This misconfiguration in Google’s infrastructure significantly disrupted the stability of our web push notification service, leading to notable interruptions in reaching our users and reliability issues.

This configuration error on Google’s part underscores the critical importance for developers to exercise caution in API integrations and anticipate the broader implications of their changes across the system. It became evident that errors made in inter-company API usage have the potential to affect the entire ecosystem. This situation highlights the necessity for technology providers to diligently monitor updates and changes and continually review integration points.

You can see detailed incident updates in Picture-1.

Picture 1 — Push messaging incident details in Google

The incident began at 2024–05–09 07:10 and ended at 2024–05–09 12:02 (all times are US/Pacific).

Incident details link: https://status.firebase.google.com/incidents/it11mtP2rU7xzwsWxkEe

How did we find out the problem?

In May 2024, the prolonged outage we experienced was caused by a misconfiguration in Google’s push messaging services. We initially noticed the issue when the alarms for our push deliveries turned red, indicating notifications were not reaching end-users, suggesting a potential problem. Upon quickly initiating investigations, we determined that the deliverability issues were not originating from our internal systems. Confirming this with error messages and updates from Google, we identified a fault in Google’s push messaging APIs.

This discovery highlighted the critical importance of our technical teams’ swift intervention. The process of identifying and pinpointing the root cause underscored the sensitivity of API integrations that we must continuously monitor for the reliability of our systems and services. This experience offers valuable lessons for other technology providers facing similar situations.

Additionally, thanks to circuit breakers between our internal services and Google’s push messaging API, alarms were immediately triggered, as depicted in Picture 2, prompting swift investigations.

Furthermore, utilizing AWS services such as CloudWatch enabled us to quickly detect the issue’s onset. Additionally, alarms set up on AWS Lambda, another service we use, alerted us to prolonged processing times for requests and failures to receive successful responses. Based on these findings, we identified the problem’s origin as Google’s end and promptly initiated resolution and action plans.

We can see anomalies in Pictures 2, 3, and 4.

Picture 2 — Circuit Breaker Alerts

As shown in Picture 2, the failures in responses sent to the push API led to the activation of circuit breakers.

Picture 3 — Web Push Sender CloudWatch Alarm

Picture 3 shows the CloudWatch metrics dashboard indicating errors. These anomalies were reported to us as alarms, prompting us to take immediate action. Thus, with accurate alarms, timely alerts, and swift interventions, we can minimize significant damages.

Picture 4 — Increased Lambda Function Average Duration

As mentioned earlier, upon investigation of the failed resource as discussed, logs indicated errors were due to Google returning failures for some requests. Additionally, increased processing times in AWS Lambda services we use for certain aspects of web push delivery led to failures in delivering web push notifications. As shown in Picture 5, errors returned by Google confirmed that this incident originated from their end.

Picture 5 — Lambda Function Errors

Expected System Operation

In this section, we will provide a brief overview of Lambda mentioned in the blog. Lambda is AWS’s serverless compute service. For more detailed information, please refer to the following link: AWS Lambda

As shown in Picture 6, under normal conditions, the system that sends web push notifications clocks very short durations.

Picture 6 — Lambda Function in Different Region

The concept of Lambda serving us here is to receive requests along with payloads, send requests to the push API, and return the response to us. In general, Lambda performs this task for us.

The primary reason for using Lambda is its ability to scale rapidly during sudden spikes in load (such as t1 events) and reliably process and complete requests.

For example, Lambda handles requests from various APIs, swiftly sends requests to the push API, receives responses, and continues. As depicted in Picture 4, it’s evident there is an anomaly where the duration of the Lambda processing requests is significantly prolonged, which serves as evidence of the issue.

Our Solutions

As shown in Picture 1, when the circuit breakers kicked in, we attempted to deliver the messages we needed to deliver. During the incident that lasted 4.5 hours, continuous retry operations could have caused damage to our various services. However, with the correct strategies that we implemented — such as using circuit breakers, setting up and monitoring anomaly alarms properly, and ultimately intervening on time upon recognizing these issues — we were able to ensure that we delivered our customers’ web push notifications correctly.

For those interested in understanding the concept of circuit breakers, you can read the blog post at Circuit Breaker Pattern.

Conclusion

The Google push messaging incident we experienced underscored once again how critical the health of our systems and customer experience are. Through proactive measures like circuit breakers, we were able to successfully deliver web push notifications to our users promptly during the outage. These mechanisms helped minimize service disruptions and mitigate potential reputational losses for our customers.

What matters is not just having a robust technology infrastructure but also its flexibility and preparedness to handle such situations. The lessons learned from this incident guide us in responding more swiftly and effectively to similar situations in the future. Customer satisfaction and service reliability will always remain our top priorities, prompting us as technology providers to consistently monitor updates, continually test integration points, and implement improvements when necessary. The experiences gained in this process contribute to enhancing our operations and providing valuable insights to assist other technology providers facing similar challenges.