Maintain your application’s availability during a cloud update

Published in

IBM Cloud

9 min readApr 25, 2019

Application availability in a cloud environment

When you’re working in a cloud environment, it’s inevitable that you will need to upgrade or update the version that you’re currently working in to the most current version available. But how do you maintain the availability of your containerized applications during an upgrade?

In this blog post, I’ll discuss some tips to do this and then use a scenario where I upgrade my IBM Cloud Private instance as an example. In an ideal situation, the application availability would reach zero downtime, which is the goal.

How a typical containerized application works

Let’s see what a general containerized application looks like by using a sample online banking (OLB) application as our scenario. The following diagram illustrates the inner workings of a typical user app:

It has a front-end application, which is running with replicas of pods and a back-end database server that is running with other replicas of pods. Front-end pods and back-end pods communicate with each other, and the front-end application exposes itself as an externally reachable URL to users.

For pod-to-pod internal access, the service is usually used inside a cluster. A set of pods exposes its functionality as a service, and then other pods can access it by calling the service name. When a pod calls a service name, a domain name service (or KubeDNS in this particular scenario) transfers the service name to the clusterIP. Then the traffic to the clusterIP will balance the load to the back-end pods of the service. In the sample app OLB example, the back-end application exposes itself as a back-end service, and the front-end pod calls the back-end pod by calling the service name. The front-end application exposes a front-end service name in the same way. (To be clear, the diagram is not the workflow of a front-end service).

For external access, an ingress can be configured to expose the service as an externally-reachable URL. In the OLB example, the front-end application will also expose itself as an ingress, and external users can access the OLB by accessing the ingress URL.

What happens during an update?

So what can you expect from your application’s availability during your cloud version upgrade?

Generally, you can expect the following three stages for your containerized application during an update:

Preparing the cluster for upgrade.
Upgrading the cluster core components.
Upgrading the cluster add-on components.

In the first stage of preparing the cluster for upgrade, the cluster data is normally backed up before the real upgrading process begins. In the second stage, core components will upgrade to a newer version. Using IBM Cloud Private as an example, components like apiserver, controller-manager, and scheduler will upgrade to a newer version. Generally, applications won't call the core components directly, so the first two stages won't affect your applications!

In the third stage, add-on components like Calico, KubeDNS, and the NGINX Ingress Controller are upgrading, as your applications rely on these components, so you can expect there will be some (minimal) outage during the add-on components upgrade portion.

Tips to implement ahead of an update

There are four places where outages can occur during an update.

1. Container

Besides the 3 upgrade stages I mentioned, you might also want to upgrade the container version, like Docker. Upgrading Docker will restart all the containers that are running on the host. This will affect your application availability and this is an outage that is hard to avoid.

2. Container network

Pod-to-pod communication depends on the stability of container network. Upgrading network components might affect your container network. The stability of the container network depends on your cloud cluster. Using IBM Cloud Private as an example, IBM Cloud Private uses Calico as the default Container Network Interface (CNI) plug-in, and there is no downtime in the container network during upgrade.

3. DNS

As the typical containerized application shows, pod internal communication depends on the cluster domain name service. To reach a zero downtime upgrade, DNS must achieve both a graceful shutdown and a rolling upgrade. Graceful shutdown ensures the processing request is finished before pods exist. For a rolling upgrade request, you need to have at least two pods in your instance and ensure that there is always a pod available during upgrade.

If the upgrade can’t achieve graceful shutdown and rolling upgrade, the outage can last a few seconds during a DNS upgrade. During the outage, internal calls to the pod by service name might fail.

4. Load balancer

If the load balancer upgrade can’t achieve graceful shutdown and rolling upgrade, there will be an outage and external access will be affected.

In general, DNS and load balancer outages can occur during an update that might affect your applications and therefore your users’ experience.

The following tips can be done ahead of an update to improve application availability.

1. Address a domain name service outage (KubeDNS)

Use KubeDNS as an example. In Kubernetes, a KubeDNS upgrade by daemonset can achieve a rolling upgrade. The outage is mainly because KubeDNS is not gracefully shut down.

To fix this gap, you can add a preStop script in the KubeDNS daemonset to gracefully shut down the KubeDNS pod within 10 seconds:

lifecycle:
     preStop:
       exec:
         command:
         - sleep
         - 10s

I also strongly suggest that you add a retry mechanism in the code level ahead of a known update. As you know, many factors can affect pod-to-pod communication, such as an unstable network. But the retry mechanism can help improve your application’s availability. For example, the following piece of code will try the connection 10 times:

int retries = 10;
 for (int i = 0 ; i < retries ; i++) {
   try {
   // your connection code logic to http://service-name
   } catch (Exception e) {
     continue;
   }
 }

2. Address a load balancer outage (NGINX Ingress Controller)

Let’s use an NGINX Ingress Controller as an example. The NGINX, which is the core binary that is used inside the NGINX Ingress Controller, already achieves graceful shutdown, and rolling upgrade can be achieved by upgrading the daemonset.

In an HA environment, an external load balancer (for example, HAProxy) is usually used to route requests to the cluster NGINX Ingress Controller. In this case, the outage is mostly due to the time window before the external load balancer detects the exit of the old NGINX Ingress Controller pod.

NGINX Ingress Controller provides a default health check URI, http://node-ip/healthz, which can be used to make better a health check. Using HAProxy, as an example, you can perform a health check by checking an HTTP service. Here is the configuration example for a health check:

listen icp-proxy
     bind :80, :443
     mode tcp
     option tcplog
     option httpchk GET /healthz
     http-check expect status 200
     server server1 172.16.205.111 check fall 3 rise 2
     server server2 172.16.205.112 check fall 3 rise 2
     server server3 172.16.205.113 check fall 3 rise 2

Notes:

option httpchk GET /healthz means that GET is the method that is used for the build HTTP request and /healthz is the URI that is used for the HTTP request.
http-check expect status 200 sets the expected response status code to 200.
check fall 3 rise 2 sets the number of consecutive valid health checks before considering the server as DOWN and UP.

The accuracy of the health check depends on your external load balancer. For HAProxy in my testing, there is a time window around 2–3 seconds before it detects that an NGINX Ingress Controller pod is DOWN. So I strongly suggest that your application implements a retry mechanism if it’s sensitive to connection failure.

During your cloud upgrade, you might need to revert to an earlier version of your cloud instance if the upgrade fails. When/if you revert to an earlier version of your cloud instance, you can still keep your application availability by avoiding the creation of new workloads or deployments, as the revert will roll back data to your previous content and you will lose any new workloads and deployments.

Testing results

Let’s look at the online banking (OLB) application’s availability during an upgrade. HAProxy is being used as the load balancer and JMeter to test the connection during the upgrade.

# ./jmeter -n -t OLB.jmx -JHOST=load-balancer-hostname -JPORT=80 -j LOGS/jmeter/jMeter_test_log -l results.jtl -e -o LOGS/jmeter/resultReport -JTHREAD=100 -JDURATION=6000 -JRAMP=300 Creating summariser <summary> Created the tree successfully using OLB.jmx Starting the test @ Wed Jan 09 06:36:29 PST 2019 (1547044589185) Waiting for possible Shutdown/StopTestNow/Heapdump message on port 4445 summary + 1 in 00:00:00 = 2.7/s Avg: 122 Min: 122 Max: 122 Err: 0 (0.00%) Active: 1 Started: 1 Finished: 0 summary + 1572 in 00:00:30 = 52.5/s Avg: 104 Min: 102 Max: 118 Err: 0 (0.00%) Active: 11 Started: 11 Finished: 0 summary = 1573 in 00:00:30 = 51.9/s Avg: 104 Min: 102 Max: 122 Err: 0 (0.00%) summary + 4427 in 00:00:30 = 147.5/s Avg: 104 Min: 102 Max: 129 Err: 0 (0.00%) Active: 21 Started: 21 Finished: 0 summary = 6000 in 00:01:00 = 99.4/s Avg: 104 Min: 102 Max: 129 Err: 0 (0.00%) summary + 7313 in 00:00:30 = 243.9/s Avg: 104 Min: 102 Max: 124 Err: 0 (0.00%) Active: 31 Started: 31 Finished: 0 summary = 13313 in 00:01:30 = 147.4/s Avg: 104 Min: 102 Max: 129 Err: 0 (0.00%) summary + 10172 in 00:00:30 = 339.1/s Avg: 104 Min: 102 Max: 131 Err: 0 (0.00%) Active: 41 Started: 41 Finished: 0 summary = 23485 in 00:02:00 = 195.2/s Avg: 104 Min: 102 Max: 131 Err: 0 (0.00%) summary + 12990 in 00:00:30 = 433.0/s Avg: 104 Min: 102 Max: 237 Err: 5 (0.04%) Active: 51 Started: 51 Finished: 0 summary = 36475 in 00:02:30 = 242.7/s Avg: 104 Min: 102 Max: 237 Err: 5 (0.01%) summary + 14292 in 00:00:30 = 476.4/s Avg: 114 Min: 69 Max: 6013 Err: 48 (0.34%) Active: 61 Started: 61 Finished: 0 summary = 50767 in 00:03:00 = 281.5/s Avg: 107 Min: 69 Max: 6013 Err: 53 (0.10%) summary + 18618 in 00:00:30 = 620.6/s Avg: 106 Min: 102 Max: 6011 Err: 3 (0.02%) Active: 71 Started: 71 Finished: 0 summary = 69385 in 00:03:30 = 329.9/s Avg: 107 Min: 69 Max: 6013 Err: 56 (0.08%) summary + 21588 in 00:00:30 = 719.5/s Avg: 104 Min: 102 Max: 133 Err: 0 (0.00%) Active: 81 Started: 81 Finished: 0 summary = 90973 in 00:04:00 = 378.5/s Avg: 106 Min: 69 Max: 6013 Err: 56 (0.06%) summary + 24390 in 00:00:30 = 812.9/s Avg: 104 Min: 102 Max: 135 Err: 0 (0.00%) Active: 91 Started: 91 Finished: 0 summary = 115363 in 00:04:30 = 426.8/s Avg: 106 Min: 69 Max: 6013 Err: 56 (0.05%)

From the summary report, JMeter keeps running during the entire cloud upgrade process. The summary report calculates the failed connection every 30 seconds and accumulates it to a total number. From the results, we can also see that:

Between 2:30 and 3:00, there is a short downtime, which happens during upgrading NGINX Ingress Controller, because of the time window of the health check I mentioned.
In 30 seconds, JMeter sends 14292 requests and 48 failed. The failure percentage is about 0.34% in 30 seconds. The downtime is a scattered time point within 30 seconds, when in actuality the downtime happens at the time point when the NGINX Ingress Controller pod exits. So, if your retry mechanism can handle it, your cloud upgrade won’t affect your app’s availability!

IBM Cloud Private and upgrade features

With the recent update for IBM Cloud Private version 3.1.2, developers who are interested in a platform for developing on-premises, containerized applications can benefit from the new feature of a multi-version upgrade. (This means that you can upgrade from version 3.1.1 to 3.1.2 and version 3.1.0 to 3.1.2.) IBM Cloud Private also provides support for user application availability when you upgrade to 3.1.2 in a high availability (HA) IBM Cloud Private cluster.

If you already use or want to use IBM Cloud Private, upgrading to IBM Cloud Private 3.1.2 means that management components will be rolling the upgrade to newer versions and application pods will continue to run during the upgrade. In general, traffic to applications continues to be routed. (Refer to the IBM Cloud Private 3.1.2 Knowledge Center and note that there is still a short outage during the upgrade.)

Other features of IBM Cloud Private’s upgrades include:

User application pods won’t be affected. All the pods keep running, which means no pods exit or restart.
For internal pod-to-pod communication, there is no downtime in the pod container network.
For internal access, KubeDNS always works when calling a service name. (As the IBM Cloud Private 3.1.2 Knowledge Center mentions, there is still a short outage during the KubeDNS upgrade.)
For external access, IBM Cloud Private uses the NGINX Ingress Controller. You need to configure the health check in the external load balancer to avoid an outage.

Next steps

Multicloud environments are on the rise, so it’s important to evaluate your cloud needs. If you don’t already have a cloud computing environment where access is limited to members of an enterprise and partner networks, then you might want to consider checking out IBM Cloud Private. Already an IBM Cloud Private user? See what you can do with our code patterns that are based on IBM Cloud Private. Or test your knowledge on multi-cloud management with our Learning Path.

Originally published at https://developer.ibm.com.