Recommendations for IP address failover for high availability on Google Compute Engine
TL;DR: Due to the (rare) possibility of Alias IP addresses being “stuck” on VMs, using Routes for High Availability is the preferred approach.
When migrating on premises high availability applications to Google Cloud Platform, you often have to deal with moving fixed IP addresses between VMs to emulate the behaviour of floating IP addresses or virtual IP addresses in your on premises architectures. This IP address will move between a set of usually two VMs in case the service on the primary VM or the VM itself fails.
Since on premises solutions based on gratuitous ARP do not work on Google Compute Engine, the article Best Practices for Floating IP addresses provides several solutions that can be used in such a scenario.
However missing from that article is the possibility of using Alias IP addresses which can be moved between VMs in the same subnet.
So for cases where the application wants to initiate failover themselves, should you use the Option 4 using Routes as detailed in the best practices paper or should you use Alias IP?
Let’s look at the differences:
Using Routes, when the heartbeat agent wants to initiate failover, it removes one route and adds another route to the same IP address pointing to the new VM instance:
The main drawback with this approach is that the virtual/floating IP addresses need to be outside of the IP address space used by the VPC and those addresses are not managed by GCP. This also means that you need to use custom route advertisements to use those routes over VPN or Dedicated/Partner Interconnect. The failover needs two API calls or gcloud commands to move the IP address.
Now what is different with Alias IP addresses:
The failover mechanism is pretty similar, you need two API calls or gcloud commands to move an IP address between the primary and secondary VM. It seems to address the main drawback of the Routes based option as the address is native to GCPs VPC address ranges and can be addressed from anywhere within the VPC and automatically as well over VPN and Dedicated/Partner Interconnect. So it seems like a superior option in every regard, no?
Unfortunately not: The fact that the Alias IP address is natively attached to a VM, so it is part of the VMs metadata, is also the biggest drawback in terms of usability for high availability configuration. There are certain rare failure modes outside the users control where a VM might become completely inaccessible and the VM including metadata cannot be immediately deleted or modified. So in those cases users might not be able to remove the Alias IP address from a failed VM temporarily and therefore also not add it to another VM, as every Alias IP address can only exist once in the VPC.
With Routes however, the Route is a global resource and is not attached to a VM, but pointing TO a VM instance. So the possibility of removing and creating a Route does not depend on the availability of the VM and its metadata. In addition, if the keepalive/heartbeat process for whatever reason creates unexpected results, the process can also be overridden manually by creating a route with a higher priority (which means a lower priority value).
Due to this difference, it is recommended to use Routes instead of Alias IP, even if it means managing the IP address space manually.
Of course, to reach the highest availability, make sure you distribute the VM instances between different zones and look if one of the other options from the best practices document might fit your use case better.