Implementing Disaster Recovery for IBM Cloud Private on Power using Geographically Dispersed Resiliency

[Authors: Adhish Kapoor, Dishant Doriwala, Peeyush Gupta, Pradipta Banerjee, Vikas Bhardwaj]

There are two possible disaster recovery (DR) approaches that you can take for IBM Cloud Private - the application driven approach, or the infrastructure-technology driven approach.

Application driven

When you use the application-driven approach, the application or databases are DR-ready, and they ensure that the appropriate data replication and consistency exist across the instances of the application or database.

The application-driven approach requires a complete, running IBM Cloud Private platform on the DR site.

Infrastructure technology driven

In the infrastructure technology-driven approach, you use the following DR and data replication concepts that are provided by the infrastructure layer:

  • A data-replication solution that maintains a copy of the production data on the DR site
  • A recovery orchestrator that re-provisions the required components on the DR site

Note that an infrastructure technology-driven approach can only provide “crash consistency”. Crash consistency means that the solution can create the same condition that happens in a datacenter when there is an instantaneous power failure.

You can read more about the two DR approaches for IBM Cloud Private by reading - Private cloud for maximum control with the benefits of cloud.

This article describes how you can configure an infrastructure technology-driven DR approach for IBM Cloud Private that is running on IBM Power Systems with the PowerVM hypervisor by leveraging Geographically Dispersed Resiliency (GDR).

Using GDR for the DR of IBM Cloud Private on Power can shorten the recovery time objective, improve recovery point objective, and provide a simple and automated disaster recovery operations.

Geographically Dispersed Resiliency

The Geographically Dispersed Resiliency for Power Systems™ solution is a disaster recovery solution that is easy to deploy and provides automated operations to recover the production site. The GDR solution is based on the Geographically Dispersed Parallel Sysplex™ (GDPS®), which offers concepts that optimizes the usage of resources. This solution does not require you to deploy the backup virtual machines (VMs) for disaster recovery. Thus, the GDR solution reduces the software license and administrative costs.

The GDR solution is based on the VM restart technology. The VM restart-based high availability (HA) and DR solution relies on an out-of-band monitoring and management component that restarts the virtual machines on another hardware infrastructure when the primary host infrastructure fails.

The following diagram shows the VM restart-based disaster recovery model used by GDR:

Ref: https://www.ibm.com/support/knowledgecenter/en/SS3RG3_1.2.0/com.ibm.gdr/vmrm_introduction.htm

A typical GDR deployment looks like the following model:

Ref: https://www.ibm.com/support/knowledgecenter/en/SS3RG3_1.2.0/com.ibm.gdr/vmrm_introduction.htm

As shown in the previous topology diagram, a typical GDR solution uses the following subsystems:

  • Controller system (KSYS): KSYS is a fundamental component and provides a single point of control for the entire environment managed by the GDR solution. The KSYS uses the Hardware Management Console (HMC) to interact with the hosts and Virtual I/O Server (VIOS), and uses the storage controller to interact with the storage subsystem.
  • Site: Sites are logical names that represent the primary and disaster sites. You must create sites at the KSYS level. All of the HMCs, hosts, VIOS, and storage devices are mapped to one of the sites.
  • HMC: The HMC is an appliance that is used to manage the Power hosts. The controller system (KSYS) interacts with the HMC.
  • Host: A host is a managed system in the HMC that is primarily used to run the workloads.
  • VIOS: A VIOS is a special Power server partition that virtualizes system resources, allowing hardware resources to be shared between several virtual machines (VMs). VIOS is available when using the PowerVM hypervisor.
  • Virtual machines (VMs) or logical partitions (LPARs): VMs or LPARs are created using the resources allocated from the VIOS.
  • Storage: The GDR solution relies on storage replication from the primary site to the backup site.
  • Network: The network must already be configured for the existing resources that include hosts, HMCs, VIOSes, and storage devices. The GDR solution requires that the KSYS node is directly connected to the HMCs and the storage controllers at both of the sites.

You can read more about these subsystems at the following links:

Deployment Architecture

The following diagram shows the example deployment that was used as part of this article:

The following sections identify the key components of the deployment architecture that are shown in the previous diagram:

IBM Systems Storage

IBM Systems Storage provides the underlying storage infrastructure in this solution. Volumes for various workloads and IBM Cloud Private components including worker nodes, master node, proxy node, are configured on IBM Systems Storage. IBM Storwize V7000 storage is used in this example at both sites.

IBM Power Systems

IBM Cloud Private components including worker nodes, master node, proxy node, are deployed on IBM Power Systems that are running PowerVM hypervisor. Power 8 processor based S822 servers are used in this example at both sites.

IBM Cloud Private

IBM Cloud Private is a Kubernetes-based containers management platform. Required components of the setup including boot node, management node, proxy node, master nodes and worker nodes, are deployed as guest VMs on IBM Power servers. IBM Cloud Private is deployed only at the primary site. When DR is triggered, IBM Cloud Private and the deployed workloads failover to the DR site.

Boot node

A boot or bootstrap node is used for running installation, configuration, node scaling, and cluster updates. Only one boot node is required for any cluster.

Master nodes

A master node provides management services and controls the worker nodes in a cluster. Master node host processes that are responsible for resource allocation, state maintenance, scheduling, and monitoring. For an HA environment, multiple master nodes can be used. If the leading master node fails, failover logic automatically promotes a different node to the master role.

Proxy node

A proxy node is a node that transmits external requests to the services that are created inside a cluster.

Worker nodes

A worker node is a node that provides a containerized environment for running workloads. As demands increase, more worker nodes can easily be added to a cluster to improve performance and efficiency. A cluster can contain any number of worker nodes, but a minimum of one worker node is required.

KSYS controller node

KSYS is configured at the DR site. In this setup, VIOS servers are configured and powered on at both sites and RMC is kept in active state. VM profile for IBM Cloud Private nodes are only created and activated at the primary site.

Deployment Procedure

The deployment is carried out in three parts:

1. Infrastructure deployment - Includes configuration of server, storage, network and virtualization components at both sites

2. IBM Cloud Private deployment - Includes IBM Cloud Private deployment and configurations at the primary site

3. GDR deployment - Includes GDR components deployment and configurations at both the sites

Infrastructure deployment

  • The Power servers at each site are added to the respective IBM Storwize V7000 storage solution. Volumes are created on each storage solution and used for VMs. It’s important to ensure that the size of volumes that are created are the same for both the primary and the DR sites.
  • Configure replication from the source storage disk to the target storage disk. Put all of the master, proxy, and management nodes in the same consistency group. Adding them to the same consistency group ensures the correct write sequence on the etcd datastore, the related configuration data, and another group for all of the worker nodes.

IBM Cloud Private deployment

  • IBM Cloud Private is installed only on the primary site. Instructions for installing IBM Cloud Private are detailed in the Installing topic of the IBM Knowledge Center (https://www.ibm.com/support/knowledgecenter/en/SSBS6K_3.1.1/installing/install.html).

GDR deployment

  • The GDR software contains the KSYS package that you must install on a logical partition at the backup site to manage the disaster recovery environment. Instructions for installing KSYS are detailed in the Installing the KSYS filesets topic of the IBM Knowledge Center (https://www.ibm.com/support/knowledgecenter/en/SS3RG3_1.2.0/com.ibm.gdr/install_ksys.htm).
  • After the KSYS component is installed, use the ksysmgr command to manage the entire environment for disaster recovery. Detailed steps on the KSYS setup, discovery of resources and configuration verification are detailed in the Configuring GDR topic in the IBM Knowledge Center (https://www.ibm.com/support/knowledgecenter/en/SS3RG3_1.2.0/com.ibm.gdr/config_vmrm.htm).

Disaster Recovery Testing

GDR provides a failover rehearsal feature which can be very useful to rehearse the disaster recovery operation without performing a real DR failover. Detailed steps are described in the Failover rehearsal of the disaster recovery operation topic in the IBM Knowledge Center (https://www.ibm.com/support/knowledgecenter/SS3RG3_1.2.0/com.ibm.gdr/recover_failover_rehearsal.htm).

Disaster Recovery Process

The following flow chart provides a summary of the disaster recovery process using GDR:

Ref: https://www.ibm.com/support/knowledgecenter/SS3RG3_1.2.0/com.ibm.gdr/admin_flow_chart.htm

As shown in the previous flowchart, you can initiate the site switch by using the ksysmgr command to simulate an outage. In an unplanned outage, the KSYS subsystem analyzes the situation and notifies you about the disaster or potential disaster. Based on the information about the disaster, you can determine whether or not a site switch is required.

Detailed steps for determining whether a site switch is required are provided in the Initiating the disaster recovery topic in the IBM Knowledge Center (https://www.ibm.com/support/knowledgecenter/SS3RG3_1.2.0/com.ibm.gdr/admin_initiate_dr.htm).

Once the VMs are active, log in to the IBM Cloud Private dashboard and verify the health of the cluster including the workloads.

I hope this helps in planning for your IBM Cloud Private on Power deployment.