Increase the resilience of your workloads with Google Cloud

Published in

Google Cloud - Community

10 min readOct 9, 2023

A world map of Google fiber optic cables. — Meet Google network. Google is the largest owner and investor in submarine cables among hyperscalers. Source: https://cloud.google.com/about/locations#network

The life of a network packet is full of adventure, often traveling the world at the speed of light through fiber optic cables laid deep under the sea, bouncing around from one router to another like in a pinball game, before ending up in a chip made of billions of tiny transistors invisible to the human eye.

Have you ever wondered how your application would react if anything didn’t go as expected on this perilous journey ? Is it enough for your application to run on a redundant infrastructure to be protected ? What are the actual risks your application is exposed to ? Who should manage them ? Do you need to redesign your application ? Are there quick and non-disruptive options available to avoid those risks ?

Those are good questions that will be addressed in this article through a simple 3-step methodology that will help you get started on building a good resilience strategy for your applications on Google Cloud.

No specific technical knowledge is required to follow along although it is better to have some technical background to make the most of it. A special effort has been made to keep explanations simple and to the point.

This article is the first part of a series around resilience. Upcoming articles will focus more on resilience techniques such as snapshots, instance groups and load balancers.

Links to the next parts will be added as soon as they are ready.

First thing first, let’s define “resilience” in the context of computing where this term has become quite trendy in the last few years, sharing the stage with “availability” and “reliability”.

There are many types of resilience such as physical, emotional, social, economical, etc.

In the field of computing, resilience could be defined as the ability of an application, a system or a piece of hardware to react, absorb, adapt to, and recover from disruptive events such as a disk failure (true, hardware doesn’t last forever), a broken code pushed into production, a power outage or even a fire.

Resilience must be applied at every layer of your application’s technical stack from the data being manipulated by your code (software) to the machine (hardware) that executes it. It is true even when you don’t own the full stack, you still need to understand the different levels of resilience exposed by the underlying layers that you don’t manage in order to build an efficient end-to-end resilience strategy.

Traditional, virtualized and container deployments. — The 3 most common application deployments. Source: https://kubernetes.io/docs/concepts/overview/

A good resilience strategy should always start by answering the following 3 questions:

What assets do I need to protect ?
What risks do I need to mitigate ?
What levels of protection do I need to set ?

What assets do I need to protect ?

Any piece of software or hardware or even information that is required for your application to continuously run as intended must be included in your resilience strategy even if it has a short-lived usage or is not directly managed by your organization.

Do you know how many applications failed to properly recover after an outage because of a missing DNS entry, a file share or a certificate ? Too many.

Asset inventory

If you don’t already have one, start building an exhaustive inventory of your assets.

Google Cloud can assist you in this task with tools such as Cloud Asset Inventory, Network Topology (Network Intelligence Center). You can also scan your assets outside of Google Cloud, either on-premises or on other cloud providers with Google Cloud Migration Center, an agentless managed discovery and assessment service.

Pro tips:

Watch for assets that don’t necessarily have an exposed network interface (private or public IPs), meaning that they are not directly accessible to you through standard network protocols (RDP, SSH, HTTP/S, etc.)
Watch for idle or turned off assets used once in a while by your application. They may not respond to network scanning tools or return any results to commands such as >netcat or >netstat

Application mapping

The purpose of this task, also known as dependency analysis, is to determine all the assets used by each application and understand their relationships.

In Google Cloud, you can use a tool called Migration Center discovery client (MCDC) CLI (previously known as mFit) to perform this task.

Although scanning tools are great to help getting started or verify the information in our good old CMDB, I would like to mention that the power of a good conversation with the application owners should never be underestimated to expedite this task

Pro tips:

It is not always something fun to do but having a well documented low level design of your app along with its network traffic flow comes very handy when troubleshooting a major issue or working on improving the end-to-end resilience of your application. Even if it is only made of squares and circles that look like a drawing from kindergarten don’t feel bad about it, a bad drawing is always better than no drawing at all
Red-flag the assets shared by multiple apps, they are usually infrastructure services (identity servers, file servers, DNS servers, etc.). They need to be treated in a specific way. For instance, you don’t want to failover all your DNS servers to a remote site because a couple of frontend servers from a single app are no longer responsive.

What risks do I need to mitigate ?

There is no perfect answer to this question simply because not all assets are exposed to the same risks.

One platform, multiple technical stacks

10 Google products are used by more than 1 billion users each month. — Google’s infrastructure runs 10 of the top 14 apps in the world. Each has 1B+ active users a month.

While each Google product is unique, they all share the same global infrastructure foundations that Google has been expanding over the past 25+ years. The same goes for Google Cloud services where each service is architectured, designed and developed to specifically meet its own requirements which include resilience.

Some services come with a built-in level of resilience that cannot be modified such as Cloud DNS which is globally available (more on this below).

Other services offer one or more options for the users to configure the level of resilience they want. In this case, the decision is often a tradeoff between availability, performance and cost. For instance, Google Cloud Storage (GCS) offers 3 location types which represent 3 different levels of resilience for each bucket: “region”, “dual-region” or “multi-region”.

Interestingly enough, bucket’s replicas in dual and multi-region scenarios are all active by design and therefore do NOT require any change in the storage path in case an entire region goes down.

Levels of resilience are an abstraction of the underlying risks of failures for each service. Instead of treating every identified risk separately, they are grouped based on the extent of their impact on the service, similar to a blast-radius, they are called “location scopes”.

There are 4 of them: “global”, “multi-region”, “region” and “zone”.

Google Cloud services — Location scopes. Source: https://cloud.google.com/architecture/disaster-recovery#regions_and_zones

What levels of protection do I need to set ?

The location scopes as well as the resilience techniques (more on this below) you choose for your application’s end-to-end protection should be based on how much downtime your organization is willing to tolerate if an incident occurs. I say organization because it is a decision that needs to be taken at the business level with the involvement of the technical teams.

Some companies in specific industries, mainly regulated ones have no choice but to aim for an objective of zero downtime* but for most applications it won’t be necessary to go to this extent. A few minutes to an hour of acceptable downtime often gives you a good balance between availability and cost since it won’t involve building an additional fully active technical stack.

Also, different application environments (production, staging, development, test, etc.) should be set with different levels or tiers of protection based on their business criticality. Not all assets would benefit from having the highest level of protection, just don’t leave any of them out of your overall resilience strategy (e.g.: development environments are often put in a lower tier of protection than production).

*As an important side note, there is no such thing as an absolute zero downtime, what we often mean by zero downtime is a configuration where a failure could still occur but would go unnoticed and would cause no disruptive impact on the technical stack. The reason is simple, a request can never be instantly processed, by the time it reaches its final destination to be executed, anything can happen. We are talking about nanoseconds or milliseconds here. There are transparent mechanisms in the lower network layers that would detect the failure and resend part of it so the whole request won’t fail.

RTO & RPO : Serial vs. Parallel communications

A level of resilience should not only be measured through downtime (or uptime, as you prefer), it should also take into consideration how much data might be lost during an incident.

Recovery Time Objective (RTO) and Recovery Point Objective (RPO) are two key metrics that cover both aspects.

RTO translates as “How long after a disaster before I’m up and running.” and RPO translates as “How much data can I afford to lose in the event of a disaster.”

I would like to make one point straight here, RTO and RTO should not only be restrained to building disaster recovery plans, they should be part of any resilience strategies since not only “disasters” cause downtime.

The available location scope(s) for each service determines the service’s RTO/RPO which is documented here (bookmark this link).

The lower the RTO/RPO, the higher the resilience level.

Now that we know that, it is not the end of the story because an application is often, if not always, a combination of multiple services and resources that need to communicate with each other (a resource is an instantiation of a service).

Adding new services and resources to your application can either increase or decrease its end-to-end RTO/RPO, it depends on how those new resources are connected to the rest of your application components. It is mainly a matter of serial versus parallel communication.

Serialization increases the RPO/RTO while parallelization decreases it.

Indeed, serialization implies increasing the number of hops that network packets have to go through to reach their final destination, therefore more components to be protected. Parallelization does not change the number of hops, it adds an additional path for the network packets to choose from to reach their final destination, it actually adds redundancy.

Parallelization is often achieved using load balancers.

RTO & RPO : SLO vs. SLA

Do not confuse a service’s RPO and RTO with its Service Level Objectives (SLOs) and Service Level Agreement (SLA).

An SLO as its name indicates is used to set an objective or a target on a specific metric called Service Level Indicator (SLI). For instance, an SLO could be that 99% of user requests to your website’s homepage must get a response time (SLI) lower than 100 ms. RPO and RTO are SLOs used to set objectives on resilience.
An SLA is an agreement that outlines the conditions under which customers are eligible to receive financial credits when SLOs are not met.

Often the confusion between RPO/RTO and SLAs comes from the fact that the vast majority of SLAs are based on uptime/downtime SLOs.

Recently announced at Google Cloud Next ’23 (Google Cloud annual tech event), Google Cloud virtual machines have new uptime SLAs: a 99.95% uptime SLA for memory-optimized VMs, and 99.9% (up from 99.5%) for all other VM families. This shows a high level of confidence regarding the resilience of our infrastructure.

Those books about Site Reliability Engineering (SRE) cover the concepts of SLI, SLO and SLA in great detail if you want to dig deeper.

High Availability and Disaster Recovery resilience techniques

Last but not least, I wanted to touch base on some resilience techniques that I will cover in more detail in upcoming articles.

High Availability (HA) and Disaster Recovery (DR) are 2 categories of techniques leveraged to increase the resilience of your workloads.

High Availability (HA)

High Availability techniques are meant to protect your workloads against small-scale failures and therefore avoid single point of failure (SPOF). They are used to make a piece of software or hardware operate continuously with zero or near zero downtime following a small-scale failure. This should be done in a fully automated way. The location scopes of HA techniques are generally zone and region because of the maximum network latency involved.

Storing a spare switch under your desk is definitely not an HA technique, however stacking switches and teaming up ports are recommended HA techniques

Google Could HA techniques & tools:

Live Migration
Synchronous replication (Regional Persistent Disks)
Managed Instance Groups
Load Balancers
Instant Snapshots
Etc.

Disaster Recovery (DR)

Disaster Recovery techniques are used to bring back online an application with its full technical stack following a large-scale failure (natural disaster, multiple power circuits/power plant down, etc.) that would have made the primary site unusable. This could be done in a manual, semi-automated or fully automated way. The location scopes of DR techniques are generally multi-region and global because of the large impact of the failures involved.

Google Cloud DR techniques & tools:

Asynchronous replication (Persistent Disk asynchronous replication, Machine image, Google Cloud Storage multi-region buckets, etc.)
Load Balancers
Backups
Google Cloud Backup & DR
Standard/Backup Snapshots
Google Cloud Storage versioning
Etc.

I hope that at this point, you better understand the concepts around resilience and are able to start building a proper resilience strategy for your workloads on Google Cloud.

Stay tuned for the upcoming articles on resilience techniques and tools.