Respawn VMs like an RPG with Autohealing and Autoupdates

Season of Scale

Stephanie Wong
Google Cloud - Community
6 min readAug 26, 2020

--

Season of Scale

“Season of Scale” is a blog and video series to help enterprises and developers build scale and resilience into your design patterns. In this series we plan on walking you through some patterns and practices for creating apps that are resilient and scalable, two essential goals of many modern architecture exercises.

In Season 1, we’re covering Infrastructure Automation and High Availability:

  1. Patterns for scalable and resilient applications
  2. Infrastructure as code
  3. Immutable infrastructure
  4. Where to scale your workloads
  5. Globally autoscaling web services
  6. High Availability (Autohealing & Auto updates) (this article)

In this article I’ll walk you through how to use autohealing and autoupdates to create health checks and maintain HA for GCP Compute Engine instances.

Check out the video

Review

So far we have looked at how Critter Junction launched and globally scaled a their gaming app on Compute Engine. With their growing daily active users, we helped them set up auto scaling and global load balancing to handle globally distributed and constantly rising traffic. Today let’s learn how they can make this social critter app more scalable by gracefully replacing failed instances.

A gaming nightmare

To keep their users from risking their daily game streaks, Critter Junction need to make sure their app is available all the time without interruptions.

One way to do that is to set up High Availability or HA at all layers of the stack. Though that can mean distributed databases, networks, and application servers, we’re focusing on their game servers running on Compute Engine.

We know that managed instance groups provide features such as autoscaling, regional (multiple zone) deployments, autohealing and auto-updating. Two features that can be tacked onto your configuration of Compute Engine are autohealing and autoupdates.

Autohealing helps proactively identify and replace the unhealthy instances (that are not responding) with healthy ones.

Auto-updates help update the instances without disrupting the service

Autohealing

Let’s focus on Autohealing for a bit.

The first step is to create a health check, which not only detects whether the machine is running or not but also detects application-specific issues such as freezing, crashing, or overloading. If an instance is deemed unhealthy, new instances are created by the managed instance group.

We’re building on the instance configuration we created in the previous article.

  1. First, create a health check in Compute Engine and give it a name.
  2. Set the protocol to HTTP.
  3. You can set the health check on any path, but let’s say the path is /health.

In our demo app we added code that ensures that /health returns 200 OK response when healthy, and HTTP 500 internal server error when unhealthy.

Set up the health criterion

  1. Set Check interval to 10, which means every 10 seconds the service will be probed for health.
  2. Set timeout to 5. which means we wait for max 5 seconds for a response to a probe.
  3. Set a Healthy threshold to 2, which defines the number of sequential probes that must succeed for the instance to be considered healthy.
  4. And finally, set an unhealthy threshold to 3, which defines the number of sequential probes that must fail for the instance to be considered unhealthy.
  5. And then create.

As a best practice, you want the health check to be conservative so you don’t preemptively delete and recreate instances.

Add a health check to an existing instance

Now, let’s go to our instance group we created in the last episodes and add a health check to it.

  1. Select the health check with an initial delay of 90 seconds.

Ideally this initial delay should be long enough for the instance to be fully running and ready to respond as healthy.

Simulate failures

Let’s have some fun with this and simulate failures now.

  1. For that, we go to the VM instance and click on external IP and make it unhealthy.
  2. Wait for the autohealer to take action and you’ll see that the green checkmark next to the instance turns into a spinner, indicating that the autohealer has started rebooting that instance.

What about when you update an instance?

One of the other concerns when it comes to HA is applying updates to instances without impacting the service. Managed instances groups allow you to control the speed and scope of the update rollout to minimize disruptions to your application. You can also perform partial rollouts, which allows for canary testing.

Let’s see that in action now!

  1. On our instance group click on the rolling update button. Rolling means it’s used for gradual updates.
  2. Add a second template for canary testing and select target size as 20%.

This means we want to send 20% of the traffic to the new instances for canary testing

3. Now, update mode is by default proactive which means Compute Engine actively schedules actions to apply the requested updates to instances as necessary. In many cases, this often means deleting and recreating instances proactively.

  • You can choose to perform an opportunistic update if a proactive update is potentially too disruptive. An opportunistic update is only applied when you manually initiate the update on selected instances or when new instances are created by the managed instance group.
  • Max surge means how many more instances are you willing to spin up as a part of this update. The higher value here speeds up the update, but costs more for new instances. So you face a tradeoff between cost and speed.
  • Max unavailable and min wait time: Keep them as zero but these parameters are used to control how disruptive the update is to your service and to control the rate at which the update is deployed.

And that’s it!

Through our help setting up two high availability features within managed instance groups, Critter Junction has a much more resilient architecture. Autohealing proactively identifies unhealthy instances and heals them, while auto updates update instances without disrupting the service. Stay tuned to find out what more in store for Critter Junction.

And remember, always be architecting.

Next steps and references:

--

--

Stephanie Wong
Google Cloud - Community

Google Cloud Developer Advocate and producer of awesome online content. Creator of the series, GCP Networking End-to-End; host of Google’s Next onAir. @swongful