Is Google Cloud down? Enter the Personalised Service Health Dashboard

Alistair Grew

Published in

Qodea Google Cloud Tech Blog

6 min readDec 12, 2023

Exploration of the new Personalized Service Health Dashboard and API

Source: https://as2.ftcdn.net/v2/jpg/05/76/26/87/1000_F_576268759_QNYLzMgQV9AQigONK9FYI3KdNsnvTfBM.jpg

Introduction

Clouds sometimes have outages with Google unsurprisingly no exception. The pace of innovation and no doubt challenges of scale will inevitably conspire for the occasional service blip. I firmly believe in architecting all systems, including those in the cloud for resilience using best practices of the particular vendor. Systems however do sometimes fail in unexpected ways, when these systems are cloud-based one of the first questions usually uttered is:

Is this us or ‘them’?

Obviously where ‘them’ is your preferred provider which in my case is Google Cloud.

Existing Efforts

Working in Google Cloud and for an Accredited Google Cloud MSP, I am unsurprisingly very interested in any Google service incidents. To track these internally we currently scrape the Atom Feed and post it to a Google Chat space monitored by our team who are ready to respond and notify customers if required.

Step in ‘Personalized Service Health’

Whilst normally I am pretty good at keeping up with Google Cloud Releases, the release of Service Health has either slipped me by or been quietly launched into preview. The first I heard of it was a helpful email from one of the Google Customer Engineers I work with stating:

Starting on 12 December 2023, Google Cloud is evolving how, when and where we post incident information. Going forward, Google Cloud will post to the public Cloud Service Health (CSH) dashboard as a default channel only for incidents that meet the following criteria:
High Scope — The incident has global impact or is affecting a significant percentage of customer projects across one or more regions.
High Severity — One or more products or services are unavailable or severely degraded.
Exceptions to this criteria may be made based on whether a product is onboarded to Personalized Service Health (PSH). If an incident affects multiple Cloud products, and any one of the products is not onboarded to PSH, it will be externalized to CSH for all of the affected products.
Other incidents will continue to be posted on the Personalized Service Health (PSH) as well as non-dashboard channels including support cases, the Google Customer Care Portal and the banner in Cloud Console.
CSH will remain a fallback for scenarios where posting to PSH is not possible.
As part of this change, it’s recommended that customers onboard their projects onto PSH prior to 12 December 2023.

Within about 2 minutes of reading the email I had enabled the API on a project and was having a poke around.

Before continuing I want to state that the Personalized Service Health Dashboard and API are currently in preview, so anything below is very much subject to change.

Dashboarding Heaven?

Source: Screenshot of Service Health Dashboard in one of my projects.

So my initial thoughts on the new Dashboard, well simply I like it. The first thing that stands out to me is the very similar design language to the Google Monitoring suite, especially the alerting and uptime sections.

I like the fact you can filter on several different properties to narrow your search to assess impact. Looking at the properties in more detail you see the familiar ‘event state’, title, impacted products & locations, and incident start and update time. What is perhaps the most useful new field though is ‘Relevance’ for which Google provides the following explanation:

Relevance describes how an event may impact this project, based on whether related services or locations are currently known to be affected.
Impacted
The incident is verified to be impacting your project.
Related
The incident has a direct connection with your project and impacts a Google Cloud product in a location that your project uses.
Partially related
The incident is associated with a Google Cloud product that your project uses, but the incident may not be impacting your project. For example, the incident may be impacting a Google Cloud product that your project uses, but in a location that your project does not use.
Unknown
The impact of the incident on your project is not known at this point.
Not impacted
The incident is not impacting your project.

‘Relevance’ was always a challenge previously, whilst we would get a steady stream of messages perhaps only 10% were relevant. To take a publicised example the Paris region had a major incident but none of our customer's workloads were located there so it was of limited impact to us so the updates were just noise. Ultimately if that can help mitigate a level of alert fatigue I don’t think that is a bad thing!

Alerting & Logs

Speaking of which, you can now trigger alerts through your pre-defined notification channels. The configuration is pretty complex so thankfully Google has provided some templated ideas that you can modify to your requirements:

Source: Screenshot of alert policy templates

Looking closer into the alert policies they are based on log-based alerts with the JSON even available.

Source: Screenshot of an example JSON alert (for location alerts for us-central1)

With it being based on log-based alerting it should be possible (though I have not yet tested it) to also use a log-sinks to aggregate or pass the logs onto a 3rd party system via Pub/Sub. In theory, you can view the logs in Log Explorer as per the documentation though I found I couldn’t get the log type to show up which I suspect is related to still being in preview.

Who doesn’t love an API?

Alongside the dashboard, and as with everything in Google Cloud there is also an API (also in preview), and Google even provides some helpful examples on how to use this to gather event information. The API has methods to get events at both a project and organisation level and impacts at an organisation level.

Room for Improvement?

So what do I think? Well, generally my feelings are positive. However, I think there are a couple of areas of improvement that I would like to see, again I want to state the product is still very much in preview:

An organisation-level dashboard (like Security Command Center). I think managing this from project-level UIs will quickly become cumbersome. Based on what I see in the API I think this should be possible but hasn’t yet been implemented.
Organisation-level alerting, linked to the above, I want to know if I am experiencing an impact across my org and in which projects without creating hundreds of alerting policies.
Cleaner integration with Cloud Monitoring, at the moment it’s a little clunky.

Conclusion

So I hope my little exploration into this new functionality was informative. I have to admit I am a little surprised Google is planning to make this the primary way some outages are communicated whilst the service is still in preview (perhaps it will go GA on the 12th?). Certainly, I will be investigating its use further especially in providing holistic monitoring to our customers. Until next time though keep it Googley :)