OpenShift is an amazing container platform, providing a Kubernetes platform that can be deployed on premises or on a public Cloud environment. See more information on https://www.openshift.com/
Now with OpenShift 4.1, it became super straight forward to deploy a cluster. You simply give the installer the access to your Cloud environment, and, voilà, the installer does everything: creates the VMs, installs and configures the product, and makes it ready to use. Amazing!
With the use of Kubernetes Operators, OpenShift 4.1 goes even one step further of making the configuration and scalability even more trivial: you simply defines a certain Kubernetes Custom Resource (Machine), and OpenShift will do the corresponding action of creating the VM and configuring it as a node.
Now in the light of all this automation and resilience, the question that comes to my mind is, “what happens when I start disrupting OpenShift?”
So I decided to do a series of disruption tests to see how OpenShift would recover from a failure.
In this article, I will describe the first test, then will describe tests in other articles (keeping it simple…)
I deployed a typical OpenShift 4.1 (on AWS) with the following configuration:
- 3 masters
- 3 nodes
You can see the nodes in the following output:
Test 1: Destroying a master
The first test is to answer the question: “What happen if I destroy a master?”
There are a few ways to do it: destroying the VM, making it inaccessible from the other masters, etc. I decided to ride on the Kubernetes custom resource wagon to destroy the machine custom resource.
So let’s find all the machines by running the following command:
oc get machines -n openshift-machine-api
So I will destroy a master node by running the following command:
oc delete machine <master-1>
So, what happens after I destroyed the master machine?
As you can see in the output below, OpenShift did not create a new master, so the cluster is operating with just 2.
Well, I asked OpenShift to delete the machine, and I did.
Like a Kubernetes Pod, a Machine doesn’t have a recovery mechanism. So, when it dies, it’s dead.
In order to provide resilience for Machine, you need to use a MachineSet (like what a ReplicaSet is for a Pod).
So, let’s look at the MachineSets in the environment:
You see that there are MachineSets for the workers in the different AWS Availability Zones, but not for the master.
I guess the master don’t need scalability (there are always 3), so OpenShift decided not to create a MachineSet for the master.
In this first experiment, we saw that when we delete the Machine associated with a master, OpenShift doesn’t recreate it.
Well, life is not perfect. The cluster continues operational, but very risky. If we lose another master (for any reason), etcd will stop working, and consequently OpenShift.
In the next article, I will describe how to recover from this situation.
> Learn more about how you can co-create with the IBM Garage.