VME2E s01e06: Live Migrations

How to perform maintenance on running instances?

Imagine you were running your own infrastructure. You’ve got a server farm and some VM instances running in production. These instances need to be up to serve customers and business needs… but what happens when you a server needs to be repaired? Or has to be updated?

Do you take down all of the instances running on it? Or maybe you create duplicate instances and configure a load balancer to send traffic to the new instances?

This isn’t an easy problem — for instance, you’d have to program your platform to handle having multiple instances running at the same time — and any downtime could cause you to miss out on critical features that ultimately serve customers.

Luckily, running on Google Compute Engine helps mitigate these potential risks.

Live Migration to the rescue.

When you let Google handle your VMs, you get a featured called Live Migration.

Live Migration keeps VM instances running even when the machines they’ve been running on need to be updated. No reboot needed. They do this by taking a running instance and moving it onto another machine in the same zone so that your service can keep running without any interruption while Google updates the hardware or software your services run on.

Since Live Migration is just taking your VM and running it on another machine, this means that it doesn’t make any changes to the attributes or properties of the VM. All of its properties — its metadata, its internal and external IP addresses, its network settings, and more — all of those remain unchanged.

What this means is Compute Engine Live Migration happens without your VM instances or other services even knowing what’s going on behind the scenes

Let’s take a look at how at how this works.

How does the live migration process work?

There are many components involved in making this work seamlessly, but the high-level steps are illustrated here:

Image for post
Image for post

First, something happens — like a signal from a hardware failure — that triggers a notification that a VM has to be evicted from its current host machine.

Google’s cluster management software constantly watches for these events and schedules them based on policies that control how many VMs a single customer can migrate at once, capacity utilization rates, and more.

Once a VM is selected for migration, Google notifies the VM that it’s going to be migrated. After a waiting period, the new target host is selected and it spins up a new, empty “target” VM to receive the migrating VM. These VMs use authentication to establish a secure connection between them.

During this migration the VM goes through three stages: pre-migration brown out, blackout, post-migration brownout.

  • During pre-migration brownout, the VM is still executing on the source, while most state is sent from the source to the target. For example, Google copies all the guest memory to the target, while tracking the pages that have been changed on the source. The time spent in pre-migration brownout is a function of the size of the guest memory and the rate at which pages are being changed.
  • During blackout, which is a very brief moment when the VM is not running anywhere, the VM is paused and all the remaining state required to begin running the VM on the target is sent. The VM enters blackout stage when sending state during pre-migration brownout reaches a point of diminishing returns. An algorithm is used that balances numbers of bytes of memory being sent against the rate at which the guest VM is making changes.
  • During post-migration brownout, the VM executes on the target VM. The source VM is present and might provide supporting functionality for the target VM. For example, until the network fabric has caught up with the new location of the target VM, the source VM provides forwarding services for packets to and from the target VM.

Finally, the migration is complete and the system deletes the source VM. You can see that the migration took place in your VM logs.

Live migration is a critical component of production infrastructure, so Google continuously tests live migration with a very high level of scrutiny. During testing, they use fault-injection to trigger failures at all of the interesting points in the migration algorithm. These generate both active and passive failures for each component. Achieving this complex and multifaceted process requires deep integration throughout the infrastructure and a powerful set of scheduling, orchestration, and automation processes.

Choosing availability policies

During live migration, your instance might experience a decrease in performance for a short period of time. You also have the option to configure your virtual machine instances to terminate and reboot away from the maintenance event.

This option is suitable for instances that demand constant, maximum performance, and when your overall application is built to handle instance failures or reboots.

A VM instance’s availability policy determines how it behaves when there is a maintenance event where Google must move your VM instance to another host machine. You can configure your VM instances to continue running while Compute Engine live migrates them to another host or you can choose to terminate your instances instead. You can update an instance’s availability policy at any time to control how you want your VM instances to behave.

You can change an instance’s availability policy by configuring the following two settings:

  • The VM instance’s maintenance behavior, which determines whether the instance is live migrated or terminated when there is a maintenance event.
  • The instance’s restart behavior, which determines whether the instance automatically restarts if it crashes or gets terminated.

The default maintenance behavior for instances is to live migrate, but you can change the behavior to terminate your instance during maintenance events instead.

To configure your virtual machines for live migration or to configure it to reboot instead of migrate, see Setting instance scheduling options.

For more information about live migration, see the Live migration documentation.

Doesn’t work with GPUs or Preemptible instances

Compute Engine can also live migrate instances with local SSDs attached, moving the VMs along with their local SSD to a new machine in advance of any planned maintenance.

Instances with GPUs attached cannot be live migrated. They must be set to terminate and optionally restart. Compute Engine offers a 60-minute notice before a VM instance with a GPU attached is terminated. To learn more about these maintenance event notices, read Getting live migration notices.

To learn more about handling host maintenance with GPUs, read Handling host maintenance on the GPUs documentation.

You can’t configure a preemptible instance to live migrate. The maintenance behavior for preemptible instances is always set to TERMINATE by default, and you can’t change this option. It is not possible to set the automatic restart option for preemptible instances, but you can manually restart preemptible instances again from the VM Instances details page after they are preempted.

  1. Go to the VM instances page.
  2. Select your preemptible instance.
  3. At the top of the VM Instance details page, click Start.

If you need to change your instance to no longer be preemptible, detach the boot disk from your preemptible instance and attach it to a new instance that is not configured to be preemptible. You can also create a snapshot of the boot disk and use it to create a new instance without preemptibility.

Want to know it all works? Test your availability policies.

After you set your availability policies, you can simulate maintenance events to test the effects of these availability policies on your applications. For example, you might simulate a maintenance event on your instances in one of the following situations:

  • You have instances that are configured to live migrate during maintenance events and you need to test the effects of live migration on your applications.
  • You have batch jobs running on preemptible VM instances and you need to test how your applications handle preemption and shutdown of one or more instances.
  • Your instances are configured to terminate and restart during maintenance events rather than live migrate, and you need to test how your applications handle this shutdown and restart process.

Simulated maintenance events are subject to specific API Rate Limits.

You can simulate a maintenance event on an instance using either the gcloud command-line tool or an API request.

Run the instances simulate-maintenance-event command to force an instance to activate its configured maintenance policy action:

gcloud compute instances simulate-maintenance-event [INSTANCE_NAME] --zone [ZONE]

where:

  • [INSTANCE_NAME] is the name of the instance where you want to simulate the maintenance event. You can specify multiple instance names to simulate maintenance events on more than one instance in the same zone.
  • [ZONE] is the zone where the instance is located.

Conclusions

If you run VMs on Google Compute Engine you get advanced features like Live Migration by default.

If you run VMs on Google Compute Engine you get Live Migration by default. That means you’ll get Google’s automatic maintenance upgrades without your VMs even having to be aware of it at all.

What used to be a headache for ops teams is now just a feature of running on GCE.

Stay tuned for the next article

This is a part of a series around understanding Compute Engine better.

Next article in the series is: TBD
The TOC is here.
The previous article, on disk images, is here.

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store