So, you installed IBM Cloud Private (ICP) environment, and now you need to back it up. The questions that come to your mind are:
- So you use traditional VM backups (such as VMware snapshots) to back up the ICP VMs?
- How do I back up specific ICP components?
In this article, I will talk about what you should do (and what you should not do) to back up your ICP environment.
But, first one question.
Why should you back up your ICP environment?
Great question. There are many reasons to back up an ICP environment:
- In the case of a complete failure of the IT infrastructure supporting ICP, you may be required to recover ICP data (logging and monitoring information, for example).
- There are certain ICP components that can’t be created on demand, especially the master nodes: ICP doesn’t provide a way to create a new master node.
- You need to recover an ICP environment after a catastrophic failure in one component.
But that doesn’t replace an HA strategy. In order to achieve high availability, you should deploy multiple instances of ICP, probably in different data centers or availability zones.
How not to back up ICP
Before we talk about how to back up ICP, let’s talk about how not to do it.
You might be inclined to back up all the VMs supporting ICP, using the capability provided by the hypervisor. However, there is a sticking point: every ICP master node runs a etcd, and etcd documentation clearly states not to use traditional VM backup to restore it:
A user should avoid restarting an etcd member with a data directory from an out-of-date backup. Using an out-of-date data directory can lead to inconsistency as the member had agreed to store information via raft then re-joins saying it needs that information again. For maximum safety, if an etcd member suffers any sort of data corruption or loss, it must be removed from the cluster. Once removed the member can be re-added with an empty data directory. (https://coreos.com/etcd/docs/latest/v2/admin_guide.html#disaster-recovery)
So you should not count on traditional VM backup technology to back up the master nodes.
Let’s then divide the discussion on how to back up ICP into the different kind of nodes:
If we can’t use a VM backup to back up etcd, how can we do it, especially knowing that there is currently no way to create a new master node? So, even if it’s a failure of a single master node in a multi-master environment, we need to provide a way to do it.
The key component in the ICP master node is etcd. To back it up, you should use etcd tool to create and restore snapshots. This page provides step-by-step instructions on how to back up and restore etcd.
However, we still need to provide a way to recreate a master node. So, my recommendation is the following:
- After you deploy an ICP environment, take a VM backup for every master node, using whatever technology you have available (VMware snapshots, Spectrum Protect, etc). You can continue taking regular VM backups, but the initial suffices.
- Take constant snapshots of the ICP components, as described here. Notice you should take snapshot in only one of the master nodes.
- In the case of a disaster, recover the VMs using a snapshot.
Now to recover the ICP components running the master node, there are two cases:
- If a single master node is lost in a multi-master environment, then you can simply restore the VM from a snapshot, and the etcd master, running in another master node will update the restore node
- If all the master nodes were lost, you need to use the etcd snapshot and follow the procedure describe here to restore the etcd to the latest backup.
- Take at least one VM snapshot for each master after installation.
- Take constant snapshots of ICP components.
- Restore the VM snapshot.
- In the case of loosing of master nodes, restore the etcd snapshot.
Although the management node is an optional component in the ICP installation, I recommend deploying it, as it allows us to split the master responsibility into the Kubernetes management and the ICP add-ons (monitoring, metering, and logging).
To back up the Management Node, I recommend using traditional VM backup mechanism. The frequency of the back up will determine the RPO. For example, if you take daily backups, you can loose as much as an entire day of monitoring and metering information.
- Take constant VM snapshot of the Management Nodes.
- Restore the snapshot.
Worker Nodes can be created on demand in ICP, so they should not be backed up and restored. If a certain worker node fails, you should simply create another one.
Notice that Kubernetes will take care of re-deploying the Kubernetes Pods to the other worker nodes in the case of a failure of a worker node.
Do not back up or restore the Worker Node VM.
Create a new Worker Node VM, using the procedure documented at here.
Starting with ICP 188.8.131.52, it’s possible to create Proxy Nodes as needed. So like Worker Nodes, you should not back up or restore a Proxy Node. If needed, just create another Proxy Node.
Do not back up or restore a Proxy Node VM.
If needed, create a new Proxy Node.
Backing up a Kubernetes environment is not a trivial task, especially considering etcd.
In this article we presented the solution to back up the critical ICP components and nodes.
Thanks to Hans Kristian Moen (@HansMoen), Peter Van Sickel, Gang Chen, and Tonny French for their work in this guidance.
> Learn more about how you can co-create with the IBM Garage.