Stay Ahead of the Storm: Comprehensive Insights into Google Cloud Personalized Service Health

Published in

Google Cloud - Community

15 min readApr 24, 2024

What is Service Health in general

Official Google Cloud documentation says, that Service Health is :

The Google Cloud Service Health (CSH) Dashboard provides status information of the Google Cloud products organized by region and global locale.
https://cloud.google.com/support/docs/dashboard

Referring to documentation, you have dedicated page, where you are able to get information about outages, split by :

REGIONS and particular ZONES
PRODUCTS

Those page allows you to obtain information and consume them in the following ways:

Through an RSS feed
Through a JSON History file

Fig.2 Slack RSS Feed for Cloud Service Health

➡️ You can download the schema for JSON file here.

➡️ More information about how regional info is presented

The RSS feed and JSON History file provide incident status information which can be consumed through integrations.

Use the fields marked Stable in the JSON History file, instead of the fields marked Unstable. Example: if you’re trying to programmatically identify incidents impacting a particular set of products, use the product IDs (affected_products>id), not their display names.

Lifecycle of an incident

To utilize information taken from Service Health, would be good to understand how particular malfunction is marked as incident.

Google presents following lifecycle of and incident;

Detection
Initial response
Investigation
Mitigation / Fix
Follow up
Postmortem
Incident report

Fig.3 responsibilities of the product engineering and support teams / https://cloud.google.com/support/docs/dashboard#lifecycle_of_an_incident

I will go trough each of them briefly, as this is needed for better understanding Personalized Service Health product, which is next gen of CSH ( Cloud Service Health )

Detection

If your area of interesting is monitoring, SRE and so on, you had to read or at least heard about “Site Reliability Engineering” book written by Google. Chapter 6 of that book describing black box and white box monitoring concept, used by Google to monitor their environment:

Monitoring
Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes.
White-box monitoring
Monitoring based on metrics exposed by the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or an HTTP handler that emits internal statistics.
Black-box monitoring
Testing externally visible behavior as a user would see it.
https://sre.google/sre-book/monitoring-distributed-systems/

Google Cloud uses internal and black box monitoring to detect incidents. Once incident has been detected, Google Cloud Customer Care team manages customer notification.

Initial response

Initial notification of an incident is often sparse, frequently only mentioning the product in question. This is because Google prioritize fast notification over detail. Detail can be provided in subsequent updates. To be sure, that customers are notified as good as they should be, different communication channels are used depending on the scope ans severity:

Investigate

When incident has been detected, and customers are notified, this is a time for Product Engineering Team to investigate of root cause of incident. It’s worth to mention, that Product Engineering Team is responsible for finding root cause, not the incident management. According to “Site Reliability Engineering” book this is often done by Site Reliability Engineers but might be done by software engineers or others, depending on the situation and product.

➡️ More info Chapter 12 of the Site Reliability Engineering Book

Mitigation / Fix

If incident is in progress, Customer Care and the product team trying to mitigate the issue. Mitigation is when the impact or scope of an issue can be reduced by adding for example new resources to reduce overload. However, if no mitigation has been found, if possible, Customer Care Team finds and communicates workarounds.

Workarounds are steps that you can take to solve the underlying need despite the incident. A workaround might be to use different settings for an API call to avoid a problematic code path.
https://cloud.google.com/support/docs/dashboard#mitigationfix

Google mark issue as fixed ONLY when changes have been made and Google is confident, that impact is ended indefinitely.

Follow up

During incident, Customer Care Team provides regular updates about the current status.

Postmortem

To fully understand the incident and apply reliability improvements, Google proceed with postmortem analysis each time incident occur.

➡️ More info Chapter 15 of the Site Reliability Engineering Book

Incident report

When incidents have very wide and serious impact, Google provides incident reports that outline the symptoms, impact, root cause, remediation, and future prevention of incidents. As with postmortems, we pay particular attention to the steps that we take to learn from the issue and improve reliability. Google’s goal in writing and releasing postmortems is to be transparent and demonstrate our commitment to building stable products for our customers.
https://cloud.google.com/support/docs/dashboard#incident_report

Personalized Service Health overview

The Service Health dashboard in the Google Cloud console shows incidents that are relevant to your project, their state, and the impacted Google Cloud products and locations.

Personalized Service Health using data from Cloud Service Health (CSH), utilize them and present in a human-friendly way.

Fig. 5 How Personalized Service Health works / https://cloud.google.com/service-health/docs/overview#how-personalized-service-health-works

Incidents, event states and detailed states

Incidents are emerging and active Google Cloud service outages or degradations relevant to your projects. It is a category of a service health event.
https://cloud.google.com/service-health/docs/overview#incident

Each incidents includes :

Impact — Details of the scope of the event, such as impacted Google Cloud products and locations.
Updates from Google — Periodic updates from Google Cloud support.
Relevance — Incident’s relevance to your Google Cloud project.
Symptoms, workarounds and ETAs — Information to help assess impact, apply a workaround, or learn more about the root cause.

A service health event (v1,v1beta) is any disruptive event impacting a Google Cloud product that is relevant to your projects or resources. Examples include network outages, configuration errors, and performance issues.
https://cloud.google.com/service-health/docs/overview#service_health_event

Event states

Each event have two states:

Active — event is actively affecting Google Cloud
Closed — event is no longer affecting any Google Cloud product

Detailed states

Detailed states can be applied only for incidents and can be one of the following depending on the event state;

Emerging: Google engineers are actively investigating the incident to determine the impact. An emerging incident will become either a confirmed or resolved incident once the impact assessment is complete. An active incident can be an emerging incident. Support for emerging incidents is available for Google Cloud networking products only.
Confirmed: The incident is confirmed by Google engineers and impacting at least one Google Cloud product. Ongoing status updates will be provided until it is resolved. An active incident can be a confirmed incident.
Merged: The incident was merged into a parent incident. All further updates will be published to the parent only.
Resolved: The incident is no longer affecting any Google Cloud product after action was taken. There will be no further updates. A closed incident is usually a resolved incident.
False positive: Upon investigation, Google engineers concluded that the incident is not affecting a Google Cloud product. This state can change if the incident is reviewed again.
Auto-closed: The incident does not have a resolution because no action or investigation happened. If it is intermittent, the incident may reopen .The incident was automatically closed because of the following reasons:
The impact of the incident could not be confirmed.
The incident was intermittent or resolved itself.

Relevance of the incident

Each incident have relevance status to assess the impact of all incidents to your project. If the incident’s impact to your project is possible or confirmed, it becomes available in the Service Health dashboard and API.

Relevance describes how an incident impacts your project. The relevance may change as the incident progresses :

Impacted — The incident is verified to be impacting your project
Related — The incident has a direct connection with your project and impacts a Google Cloud product in a location your project uses.
Partially Related — The incident is associated with a Google Cloud product your project uses, but the incident may not be impacting your project
Not Impacted
Unknown — The impact to your project is not known at this point.

Personalized Service Health event

Structure of each event looks similar. We can found :

Title
Status
EventID
Last update
Event Timeline
Impacted products
Relevance
Recent Update
Workaround
Symptoms
Full Incident History

How to Create Alerting policy within Personal Service Health

We went through Cloud Service Health, incident lifecycle and Personal Service Health so far. Now, we will configure example alert with notification channels (Slack and PagerDuty) for particular project and for entire organization. We will gonna also customize output for better readability.

Before you begin:

enable Personalized Service Health for a single project

Go to your project -> API Library -> Search for Service Health API”
Select Enable button

set required permissions

servicehealth.viewer Read-only access to service health events.
roles/serviceusage.serviceUsageConsumer use APIs and services in your projects

permissions for log-based alerts

To get the permissions that you need to read logs and to manage Logging notification rules, ask your administrator to grant you the Logging Admin (roles/logging.admin) IAM role on your project.

2. To get the permissions that you need to manage the alerting policies and channels used by log-based alerts, ask your administrator to grant you the following IAM roles on your project:

a. Monitoring AlertPolicy Editor (roles/monitoring.alertPolicyEditor)

b. Monitoring NotificationChannel Editor (roles/monitoring.notificationChannelEditor)

3. To get the permissions that you need to create an alerting policy in the Google Cloud CLI, ask your administrator to grant you the Service Usage Consumer (roles/serviceusage.serviceUsageConsumer) IAM role on your project.

➡️ If you don’t want to grant the Monitoring NotificationChannel Editor role (roles/monitoring.notificationChannelEditor), you can grant the Monitoring NotificationChannel Viewer role (roles/monitoring.notificationChannelViewer) instead to allow you to link to a notification channel to an alerting policy.

Go to Service Health dashboard and choose CREATE ALERT POLICY

2. Choose, which template you want to use. You can use:

All incidents, all updates
Confirmed incidents
Emerging incidents — Receive an early alert when any new issue is first detected (but not yet a confirmed incident)
Location based — Receive alerts about any new incident that impacts a location you specify in the policy, even if there is no known impact on this project
Product alerts — Receive alerts about any new incident that impacts a service you specify in the policy, even if there is no known impact on this project
New incident — Receive new incident alerts in a shorter format that works better for SMS-based notification channels.

3. Customize alert policy, by choosing time between notifications and incident autoclose duration

Time between notifications — Set the minimum amount of time between receiving notifications for logs that match the filter
Incident autoclose duration — Select a duration after which the incident will close automatically when matching log entries are absent

4. You can also customize alert policy, before alert has been created. To do that, click three dots and pick Customize alert policy

Fig. 11 Configuring log-based alert for Service Health

Fig. 12 Customize log-based alert policy

➡️ You can change the Alert policy name, set severity of following alert or deal with Documentation section. Documentation section can be written with markdown, so you will be able to create well-informed and clean incident descriptions.

🔗 More info about markdowns can be found here

Additionally, you are able to modify log query and extracted labels, which could be used for Documentation section.

Fig. 13 Customize log query and extracted labels

5. Choose notification channels. For redundancy I recommend choosing at least two independent channels like PagerDuty and Slack or email.

➡️ If you don’t have notification channels, you can either create them directly from “picking” tab or from Monitoring → Alerting -> Edit notification channels

6. Once everything is set up, click CREATE. You will see black box with information about alert policy. You can either click VIEW ALL ALERT POLICIES from that box or from top right corner picking VIEW ALERT POLICIES

7. Pick one of provided method. You log-based alert will be displayed.

Fig. 18 Newly created alert policy

8. You will be able to see matched log entries related with log query, configured at log-based alert. Those values will be also available from Logs Explorer

Fig. 20 Logs from health alert are also available from Logs Explorer

9. If alert occur, you will be notified according to your configuration, like with normal Operational Suite metrics case.

Fig. 21 Example of email notification for log-based health alerts

10. This event will be also available directly at Google Cloud Console.

11. 🎉 You did it. You’ve created and enabled alert for log-based health events. Good job👌

Organization-wide health events notifications

So far we’ve configured health notifications for one project. However, what if we want to get notifications for all projects within our organization? We can achieve that with log routers and log-based metrics.

Before you begin:

Ensure that you have the following permissions:

Permission to list projects under the parent: resourcemanager.projects.list

2. Permission to add IAM (Service Health Viewer role) for the specified IAM principal: resourcemanager.projects.setIamPolicy

3. Permission to enable Google Cloud services: serviceusage.services.enable.

Now, you have two ways of dealing globally. One is to enable APIs and grant IAM permissions per project, in a manual way ( we don’t like manual ways 😃) second is to use script provided by Google.

I don’t want to copy/paste content from Google documentation. Please follow this link. Documentation contain all mandatory information needed for global enablement Service Health API.

➡️ You can open cloudshell editor and type following commands to create, edit and execute script

At gclodu terminal: 

touch activateProjects.sh <- create file

edit activateProjects.sh <- enable shell editor


At the editor
copy/paste content from documentation


At gclodu terminal:

chmod a+x activateProjects.sh < make execute attribute for script

./activateProjects.sh ORG_ID "user:YOUR_USER" <- execute script

ℹ️ Personalized Service Health will take up to 24 hours to start processing service health events.

Create log bucket

Go to your desired project. By desired project I mean project where entire configuration for logs, metrics and alerts will be stored. Ideally, you should choose project, which can be treat as Scoping Project.
Search for Logs storage -> Create bucket -> Provide name, description. Set location as global. Retention period can be set as default.
Optionally you can use following gcloud command to create bucket

gcloud logging buckets create health-event-bucket --location=global

4. Once done, execute following command to obtain bucket details. We need bucket location, as this value will be used for log sink.

$ gcloud logging buckets describe health-event-bucket --location=global
analyticsEnabled: false
createTime: '2024-04-22T19:18:02.279680934Z'
lifecycleState: ACTIVE
name: projects/webapp-wordpress/locations/global/buckets/health-event-bucket <- copy this value
retentionDays: 30
updateTime: '2024-04-22T19:18:02.279680934Z'

➡️ More info about log storage bucket can be found here

ℹ️ You don’t have to upgrade your bucket to Log Analytics.

4. Once bucket is created, we must create log sink. Log sink will route logs, based on filters to provided location.

Following command will create log sink at the organization level, with the target bucket set to our newly created global bucket, and desired log filter, to utilize only healthevent logs

gcloud logging sinks create cloud-event-sink \
logging.googleapis.com/projects/PROJECT_ID/locations/global/buckets/health-event-bucket \
--organization=ORG_ID \
--log-filter='resource.type = "servicehealth.googleapis.com/Event"'

Fig. 22 log router creation via CLI

Fig. 23 Log router will be also available from UI

ℹ️ Following logs query sample will search for ALL events created by servicehealth and route them to our log bucket. We will configure log-based metrics with our bucket as source and there we will set proper filter. This approach allows us to make ONE log stream for all service health events, instead of creating multiple stream logs.

resource.type = "servicehealth.googleapis.com/Event

5. Once we have our log sink, we are able to create log-based metrics and point to our previously created logs bucket as a log source. To do that search for Log metrics -> Log-based metrics -> Create metric. As you can see, there are a lot of options like metric type, details, filter section and labels. Let’s stop for a while and discuss briefly each of them.

Metric Type

Log-based metrics can extract data from logs to create metrics of the following types:

Counter: these metrics count the number of log entries that match a specified filter within a specific period. Use counters when you want to keep track of the number of times a value or string appears in your logs.
Distribution: these metrics also count values, but they collect the counts into ranges of values (histogram buckets). Use distributions when you want to extract values like latencies.

Details

Log-based metric name — Name for your metric
Description — Enter a description for this metric (optional)
Units — The units of measurement that apply to this metric (for example, bytes or seconds). For counter metrics, leave this blank or insert the digit ‘1’. For distribution metrics, you can optionally enter units, such as ‘s’, ‘ms’, etc.

Filter section

Log scope — this is a place, where you define, from which logs bucket particular metric will utilize logs
Build filter — Is a box, where you combine logs query for filtering and searching. Once filter is done, you can preview logs based on your logs query.

🔴 Important: Preview logs will search logs for a time when you run this option. It means that if you search for logs with desired date, without providing exact time, logs will not show up.

Labels

You can extract values from logs and use them as labels. This will allows you to make your Alert documentation section more informative.

🔗 You can find more information about labels at log-based metrics here

6. Provide mandatory info:

Metric type: counter
Log-based metric name : all_incidents
Units: leave blank
Log scope: pick your previously created bucket

Build filter:

resource.type ="servicehealth.googleapis.com/Event" AND jsonPayload.category ="INCIDENT" AND jsonPayload.relevance !="NOT_IMPACTED"

Labels

label name: status
label type: STRING
FieldName: protoPayload.status

You can add as many labels as you want in this way.

7. Once done, click CREATE. Log-based metric has been created and it’s ready for use.

8. What we have to do right now is to create log-based alert with our newly log-based metric. Search for Alerting -> Create alerting policy -> Select metric -> Logging Bucket -> Choose your metric

Fig. 25 Alerting policy creation with log-based metric

ℹ️ Might be possible, that you will not see your metric. The reason is that most probably your metric didn’t received logs yet.

9. Set your alerting policy, as you want. Choose notifications channels, policy severity etc. I’m recommending to tick “Notify on incident closure” checkbox if you are using PagerDuty. It will close acknowledged incidents automatically.

10. If you want to use previously created labels, you can do you in Documentation section. Simply add :

${metric.label.YOUR_METRIC_NAME}

This will allows you to obtain values taken from logs and pass them as variable. This approach will also allows you to create Documentation section as follow:

Fig. 26 Example of Documentation section with log-based labels

Conclusions

I believe, that Google Cloud Personalized Service Health represents a significant advancement in how businesses can approach cloud service management and troubleshooting. By leveraging tailored health insights and proactive issue resolution, organizations are better equipped to maintain high availability and performance of their cloud infrastructure. This personalized approach not only optimizes operational efficiency but also enhances the reliability of services provided to end-users, ultimately contributing to stronger trust and satisfaction. As cloud technologies continue to evolve, tools like Google Cloud Personalized Service Health will become crucial in empowering businesses to navigate complex digital environments with greater confidence and control.

Do not forget the 👏✌️❤️ if you like this content!

Also, I will be glad if you hit the follow button so you get notified of my new posts.

You can also follow me on LinkedIn

Thank you!