Global Load Balancer failover?

DNS-based failover between global and regional Application Load Balancers on Google Cloud.

Published in

Devil Mice Labs

15 min readAug 3, 2023

UPDATE — DEPRECATION NOTICE:

Since July 2024 the set-up described in this article is no longer necessary because Cloud DNS now has health checks for external endpoints.

Rejoice! 😈

What?

The proof-of-concept deployment described in this article demonstrates a DNS-based failover from Global External Application Load Balancer on Google Cloud to a new type of load balancer on that platform — Regional External Application Load Balancer.

This kind of set up is not yet supported by Google Cloud DNS so I used Amazon Route 53 DNS service instead.

To test the failover, I use fault injection — an advanced traffic management feature — to simulate Global External Application Load Balancer failure.

To enable TLS between my example web service and its clients, I use both Google-managed and self-managed TLS certificates.

The article comes with a GitHub repository that contains all infrastructure-as-code discussed here:

devil-mice-labs/poc-lb-failover: Using Regional External Load Balancers as a failover option for Global Load balancers on Google Cloud. The failover is coordinated by Route 53 DNS. (github.com)

Why?

A recent Google Cloud Blog article named “Increasing Resiliency with Load Balancers” stated that newly launched Regional External Application Load Balancers on Google Cloud are independent and isolated from Global External Application Load Balancers.

The article went on to suggest that,

for users that prefer redundancy at every level of their architecture, Regional Load Balancers can be used as a failover option for our Global External Load Balancer. In this case if an outage is detected, typically by using a DNS solution with a health-check option, your traffic can be redirected to one of your Regional Load Balancers.

The article contained a very high-level architecture diagram of such deployment and some recommendations on how to configure the Regional External Application Load Balancer as a failover option.

No proof-of-concept implementation was referenced in the article.

Since I have an active interest in load balancing technology on Google Cloud, I thought it was a good opportunity to learn more about the new load balancer type and its quirks.

So that’s how this article came about. It’s time to go on an adventure!

System overview

From the end user perspective, the system described in this article is a public web service. It offers a TLS-secured endpoint accessible at the following URL: https://hello-service.dev.devilmicelabs.com/

My sandbox might not be there by the time you read this, but full Terraform source code is available so you could easily spin your own. It needs both Google Cloud and AWS accounts though.

The following diagram shows the key parts of the design.

As depicted in the diagram, the public web service is delivered by a number of Google Cloud and AWS resources working together:

A backend service deployed to Cloud Run without any loss of generality — it could have been any other supported backend type. The service is not directly accessible from the Internet.

The public access to the service is enabled by the two external Application Load Balancers configured in two different modes:

Global external Application Load Balancer
Regional external Application Load Balancer

As we had learned in the introduction section of this article, there is no shared Google infrastructure between those two load balancer modes so we can use the latter as a failover option for the former.

Although the load balancers are deployed in a hot-hot configuration, in the “default operating mode” for this system, the client requests are served only by the global load balancer. This is achieved via DNS.

Under normal circumstances, the DNS zone record associated with the domain utilised in a public web service endpoint URL points to the global load balancer’s IP address. This is illustrated by the “primary” request-response flow represented by the directed blue lines in the system overview diagram. The corresponding DNS record is also known as the “primary” record for that domain.

The DNS zone records also contains a “secondary” record, which maps the same domain to the regional load balancer’s IP address. But under normal circumstances, the DNS service does not respond to DNS queries with the value of the secondary record. So the secondary record is effectively ignored under the normal circumstances.

The DNS service also provides a health check — it monitors the state of the global load balancer by sending HTTPS requests to it. If the health check fails for a configured consecutive number of times, the DNS service stops responding to DNS queries with the value of primary record, and begins to return the value of the secondary record, instead.

This, effectively, means that if the global load balancer begins to return errors, the DNS is updated to resolve the public web service’s domain to the regional load balancer’s IP address.

The actual logic is a bit more involved, but this will do for a simple explanation of how the DNS-based failover works.

As I already mentioned in the introduction, Google Cloud DNS does not currently support health checks for external Application Load Balancers.

But Amazon Route 53 DNS service does! It supports health checks, failover, primary/secondary records, and some people even whisper that Route 53 can act as a… domain registrar! 🫢

It’s worth to note, that the health check continues its work after the failover to the secondary record. Once the global load balancer is back online and this fact is picked up by the health check, the records are switched again , and the traffic is sent to the global application load balancer yet again.

From what we have seen so far, this appears to be a relatively simple deployment. However, the devil is in the details.

The catch is that the diagram and the description that followed it presented a much simplified version of the actual service model. To enable successful failover between the two Application Load Balancers of different mode, a non-trivial number of “auxiliary” infrastructure resources must be created. This additional infrastructure is necessary to work around the limitations in Google Cloud services that you see in the diagram. These limitations are discussed in detail in the next section.

Service limitations

Some Google Cloud services in our design have rather serious limitations which have to be worked around to achieve our aim. This section lists the limitations and formulates the strategies for overcoming them.

Cloud DNS — insufficient health check options

I’m aware that it is the third time I mention this, but it’s a big one and I wanted all relevant limitations to be documented in a single place.

So, Cloud DNS does not support health checks for external Application Load Balancers (docs).

And so we introduce a managed service from another vendor for whom I am sure that Google has nothing but love in their collective heart ❤️

Hello, Route 53!

Regional External ALB — no Cloud Armor support

Sadly, Regional External Application Load Balancer that we would like to use as a failover option does not support Cloud Armor (docs).

Although I had not introduced Cloud Armor into this proof-of-concept, it is likely that it would have been useful in the production environment.

However, because there is no support, Regional External Application Load Balancer would not be protected by Cloud Armor in a failover scenario.

Regional External ALB — SSL woes

As of time of writing, Regional External Application Load Balancer has what I would describe as abysmal support level for SSL certificate resources on Google Cloud:

Google-managed SSL certificates are not supported (docs)
Certificate Manager SSL certificates are not supported (docs)

Thus, before we can create our Regional External Application Load Balancer, we have to provision the public SSL certificate somewhere and create a Compute Engine self-managed SSL certificate resource with it

I opted to provision a public SSL certificate from Let’s Encrypt as this is the option I am familiar with from work. However, I could not use the more convenient DNS-01 challenge type to prove domain ownership because the same domain name is also being used with Google Certificate Manager to provision a Google-managed SSL certificate for the Global External Application Load Balancer 😔

The Certificate Manager’s requirements for DNS authorisation make it impossible to use DNS to prove this domain ownership to other public CAs. So they only choice that I had was to use HTTP-01 challenge type to prove the domain ownership to Let’s Encrypt.

This also presented a small challenge because prior to that my deployment plans did not include setting up an HTTP endpoint for anything. So I had to create an additional load balancer front-end and some other resources for the Global External Application Load Balancer so that I can respond to domain verification HTTP requests.

The front-end has to be global because in order to avoid having to run an actual HTTP server to respond to the domain verification queries, I used a GCS bucket as a backend to store the verification data provided by Let’s Encrypt. Unfortunately, backend buckets are not supported by the Regional External Application Load Balancers (docs), so I had to piggyback the HTTP front-end onto the global load balancer.

Regional External ALB — VPC requirements

As odd as it may sound, Regional External Application Load Balancer requires a VPC and a proxy-only subnet (docs).

To overcome this limitation, I created a dedicated VPC network resource and a subnet in the region of interest.

Wow, that escalated quickly!

As you can see, we had to provision and configure a non-trivial number of infrastructure resources to overcome the limitations in some key services that we use.

Infrastructure-as-code (Terraform)

This section serves as a brief introduction to my infrastructure-as-code for this project. I used a modern version of Terraform and stored my state in Terraform Cloud for convenience. This backend choice, however, isn’t essential and I think you won’t have much difficulty in updating versions.tf in the root module to use a different supported Terraform backend.

The following diagram presents the the dependency graph for my Terraform modules for this deployment. The actual deployment of modules in the right order and with the right parameters is orchestrated by the root Terraform module in project repo’s deploy-dev directory.

The modules that depend on Terraform AWS provider and credentials are marked on the diagram as well.

The dependency graph for Terraform modules. All these are orchestrated from the root module.

The rest of this section provides some details on each child module.

The backend child module deploys a Cloud Run service using a “Hello” container for Google Cloud Run to act as a backend service for the load balancers. This choice of backend technology is not special — any other backend service type supported by ALBs would have worked.

The ssl-certs child module interacts with Certificate Manager service to acquire a Google-managed SSL certificate with DNS authorisation for use with the global external ALB. This choice of certificate configuration method was made due to my familiarity with its use in zero-downtime migrations from other vendors to Google Cloud. I find it convenient to be able to provision Google-managed certificates in advance. However, using this certificate configuration method requires making changes to DNS configuration that are not compatible with using DNS-01 challenge from Let’s Encrypt on this domain name as long as Certificate Manager is being used.

The lb-https-global child module deploys a global external Application Load Balancer — one of the two load balancers in this design — that provide publicly accessible HTTPS endpoint for the backend service. The load balancer is HTTPS-only by design and is configured to reject old versions of SSL/TLS protocol when negotiating TLS with clients. The global IPv4 address for the load balancer is provisioned outside of this module and passed into the module as an input variable. This is because the IP address is shared with another load balancer which is part of the “auxiliary” infrastructure described in the next paragraph. The current module also creates a primary DNS record in Route 53 for the load balancer’s IP, and also creates the health-check in Route 53. This is not enough to enable the failover though. The failover logic only begins to work until the regional ALB is deployed as well and a secondary DNS record is created for that ALB.

The lb-http-globalmodule deploys an HTTP load balancer and a GCS bucket exclusively to enable response to HTTP-01 challenge from Let’s Encrypt to acquire a self-managed public SSL certificate for use with the regional external ALB. The existence of this HTTP load balancer and other associated resources is an unfortunate consequence of limitations that I had described in a dedicated section. Since I have to deploy this HTTP load balancer to respond to Let’s Encrypt requests, I also take the opportunity to add a bit of configuration to it to make it redirect all incoming traffic other than HTTP-01 challenge requests to the HTTPS global ALB deployed by module described in the previous paragraph. Thus, despite my earlier claim that the global ALB is HTTPS-only, I actually set up an HTTP-to-HTTPS redirect for our Hello service.

The ssl-certs-unmanagedmodule uses local provisioners to request the self-managed certificate files from Let’s Encrypt. Those files are used to create a self-managed SSL certificate resource in Compute Engine. As I had detailed in the limitations section, this is the only type of SSL certificate resource supported by the regional external ALB. Please note, that Let’s Encrypt certificates expire in 90 days and the current configuration does not provide the functionality to renew the certificate. Such functionality is out-of-scope for my work on this project.

Finally, the lb-https-regionalmodule deploys the regional external ALB. This load balancer is deployed with HTTPS front-end only, utilising the self-managed Compute Engine SSL certificate resource that was created by the modules described in the previous paragraph. This module also completes the DNS configuration for the project by creating a secondary DNS record in Route 53, mapping the service domain name to the regional external ALB IPv4 address. Since the primary record and the health check had already been set up by the other module, after this module completes deployment, the health check can action a failover in the event of global load balancer failure. There is no further configuration required for the failover.

Testing the happy path

Once all infrastructure-as-code modules are deployed without errors, the system should be in what I would call the default operating mode:

the Global External Application Load Balancer is healthy.
the DNS health check does not report any issues.
the DNS service responds with the Global load balancer’s IP address.
the Global load balancer is serving client traffic.

This is our happy path and let’s verify these claims.

First, the health check monitoring data…

Route 53 health check monitoring data showing that the Global External Application Load Balancer is healthy.

Next, the DNS service response…

The DNS query to `hello-service.dev.devilmicelabs.com` resolving to the IP address of the Global External Application Load Balancer, as expected.

And an example request showing that the response did come from the Global load balancer indeed because the client was presented with a Google-managed SSL certificate…

The response served by the Global External Application Load Balancer as evidenced by the SSL certificate presented by the server.

This concludes the testing of the happy path. The Global External Application Load Balancer is serving client requests. Time to test what happens if it goes down… 😈

Global External Application Load Balancer goes down in flames

Just because I happen to like Google Cloud, I’ve taken a difficult decision to not take their whole global load balancer infrastructure down for the purpose of the upcoming demonstration of failover. As disappointing as it may be for some of you who read this article this far, hoping to see a core element of Google infrastructure to go down in flames, I hope that your lust for destruction will be satisfied with the following trick.

One of the advanced traffic management capabilities available for global external Application Load Balancers is fault injection (docs). It works by updating the load balancer’s URL map with a fault injection policy for the default route action, effectively returning HTTP 500 server error response to any request directed at our Hello service.

The fault injection is controlled by setting the Terraform input variable simulate_failure in the lb-https-globalmodule. The default value of this variable is false so no fault injection was configured when I run the module to deploy the load balancer.

For the purposes of upcoming demonstration, I set the value of the variable to true and run Terraform apply operation:

terraform apply -var "simulate_failure=true"

Output from Terraform CLI showing proposed changes to the load balancer’s URL map to enable fault injection

Once Terraform apply operation completes successfully, it will take a few moments for the updated URL map to start returning HTTP error 500 to all requests sent to the Global External Application Load Balancer. You can confirm this with your favourite option for running HTTP queries — be that a web browser, curl, Postman, or whatever. I’m just going to skip that and proceed to the health check monitoring data directly instead.

Route 53 health check monitoring data confirming that there is an issue with the Global External Application Load Balancer.

It is pretty clear that as far as the health check is concerned our fault injection process appears to have brought the global load balancer down!

The rise of the Regional External Application Load Balancer

After the configured number of consequent failures, Route 53 failovers the DNS record for the public web service domain from primary to secondary.

Now when I query the DNS I see that the service domain name points to the Regional External Application Load Balancer.

The DNS query to `hello-service.dev.devilmicelabs.com` resolving to the IP address of the Regional External Application Load Balancer.

Note that since the DNS service returns the IP address of the regional load balancer, the failover must have already happened. However, depending on the DNS time-to-live and DNS client settings we might have to wait a bit longer for the DNS changes to propagate to the client, but once they do, we can see an example of response served by the Regional External Application Load Balancer.

The response served by the Regional External Application Load Balancer as evidenced by the SSL certificate presented by the server.

Given that only the regional load balancer had the self-managed certificate attached to it, it’s pretty clear that the requests are indeed being served by the failover option.

The return of Global External Application Load Balancer

Now that we are satisfied that failover to Regional External Application Load Balancer had happened, it’s time to test if the DNS records are swapped back when the Global load balancer returns from rehab.

To restore the functionality, I revert the configuration change to its URL map. After a preconfigured time, the Route 53 health probe is going to pick up that the endpoint is now healthy and switch back the DNS records from secondary to primary.

This time, there is going to be no loss of service from a client’s perspective, because the regional load balancer continues to serve traffic until the DNS changes propagate, directing the future requests to the global load balancer.

To restore the URL map, I redeploy my module with simulate_failure input variable set to false:

terraform apply -var "simulate_failure=false"

Output from Terraform CLI showing proposed changes to the load balancer’s URL map to turn off fault injection.

I then continue to monitor the health check data until it begins to return success:

Route 53 health check monitoring data confirming that Global External Application Load Balancer is back up.

And then I validate that the DNS records have been restored to point to the Global External Application Load Balancer:

The DNS query to `hello-service.dev.devilmicelabs.com` resolving to the IP address of the Global External Application Load Balancer.

Now, once the DNS records propagate fully, the new clients will begin to be served by the Global Application Load Balancer yet again.

After a fair timeout, we can check that using the same method as before — by validating the SSL certificate issuer:

Now that we have verified that Global External Application Load Balancer has “recovered” and is serving client traffic, our testing is complete!

Concluding thoughts

When I set out to write this article I did not know much about the limitations of the new Regional External Application Load Balancer and, as a consequence, severely underestimated both the effort required to finish this work and the number of Google Cloud resources that I would have to create and manage.

Nevertheless, it was a very illuminating experience, both on a tactical and on a strategic level.

On the tactical level, I learned a lot about this new Application Load Balancer mode.

On the strategic level, I confirmed (yet again) that unless something is built, it is not well understood — the Google Cloud blog article that sparked my curiosity in this project made it look like setting up the failover would be a simple thing. As it turned out, the devil was in the details.

But so was the salvation.

Please feel free to reuse the infrastructure-as-code from this project as you see fit and if you have any questions, suggestions, or any other feedback — I’m always happy to chat about the Clouds!

Oliver F.
Devil Mice Labs

Let’s connect!

Until next time 😊