MuleSoft CloudHub (1.0) High Availability and Disaster Recovery

Published in

Another Integration Blog

12 min readDec 1, 2022

A brief discussion on a more resilient CloudHub model

Introduction:

MuleSoft’s CloudHub iPaaS platform supports high availability and disaster recovery to offer more reliablility. Many complexities of high availability architecture are managed by MuleSoft but as a MuleSoft iPaaS platform license subscriber, it is worthwhile for customers to understand some of the topics to make better decisions and design. Disaster recovery on CloudHub, on the other hand, needs a Multi Region CloudHub deployment model that has very complex architecture. Many key aspects of this architecture should be taken into consideration from the beginning to reap the benefits of this resilient platform setup. This blog is written based on the features of CloudHub 1.0 and might differ if the same disaster recovery is planned with multi regional CloudHub 2.0.

The terms high availability and disaster recovery are often used interchangeably; however, they are two separate concepts. High availability is the ability of a system to remain functional when a fraction of the system components are not available due to some fatal error or unplanned outage. Disaster recovery is the process to recover from an entire system failure due to any natural (flood, earthquake, etc.) or man made (power failure, hardware misconfiguration, etc.) disaster to keep the business operational. High availability does not impact the business continuity but disaster recovery has some impact on business continuity while switching from the primary system to the disaster recovery system. Before delving into the detailed discussion, let’s also learn two terms we often use as KPIs to measure the DR strategy: RTO and RPO.

Recovery Time Objective (RTO): the amount of system downtime a business can tolerate under the disaster. So, if it is decided that the disaster recovery system needs to be made online (when the primary system goes down) in 1 hour, then the RTO is 1 hour and the disaster recovery needs to plan accordingly.

Recovery Point Objective (RPO): measures the threshold level of data freshness of a recovered system. So, if it is decided that data recovered within 24 hours is acceptable then the RPO is 24 Hours for DR strategy and the data backup frequency needs to be set up accordingly.

The disaster recovery strategy and architecture are majorly detected by the RTO and RPO objectives. Let us now focus on single region recovery first, and then move to the multi region recovery process. But before that, let's have a glimpse at the CloudHub architecture.

CloudHub Architecture, Regions and Availability Zones

AWS Regions are separate geographic regions and isolated from each other. For example, the US-East-1 Region is at North Virginia and US-East-2 Region is at Ohio. Each region comprises multiple isolated locations known as availability zones.

The diagram above depicts the high level architecture of Mulesoft CloudHub. The MuleSoft control plane (Runtime Manager and Platform Services) is deployed on multiple AWS availability zones within a region. If one AWS zone goes down entirely, these MuleSoft Control Plane services are still functional on other AWS availability zones within that region. Because that would involve reduced computing capacity to serve the same volume of workload , there might be some performance degradation with the reduction of availability zones. However, unless all the availability zones within a region go down, MuleSoft control planes on that zone remain available to the users. The RTO for one availability zone to recover is 72 hours and RPO for that zone is 24 hours. Also, remember that because CloudHub is implemented on the Amazon AWS cloud, CloudHub availability is largely driven by Amazon’s availability strategy.

With the option to automatically restart application when not responding is selected on Runtime Manager (at the time of MuleSoft application deployment ), it ensures that CloudHub will monitor the worker and will restart the applications, if required. If the restart is not successful, it will retry the restart five times. CloudHub sends notifications about the restart attempt and whether it was successful or failed. However, if the maximum limit of restart attempts is exhausted, then CloudHub does not take any further action and the application needs to be restarted manually. This option ensures that applications can be recovered automatically without any manual intervention and by default this option is selected during deployment. Subsequent discussion assumes that this option is selected for the deployed applications on CloudHub.

MuleSoft CloudHub Single Region Recovery

In case of a single region CloudHub deployment model, MuleSoft does not provide any high availability if the MuleSoft application is deployed on a single worker. MuleSoft still offers an auto healing feature on single worker deployment. That means if an application is deployed on a single worker and that worker goes down, then MuleSoft spins up a new worker and it takes 1 to 15 minutes to complete (as per AWS EC2 machine provision SLA). MuleSoft provides an out-of-the-box alert “Worker not responding” to indicate that the worker is unavailable.

However, when applications are deployed on multiple CloudHub workers to support high availability (HA Architecture), then MuleSoft will spin up each worker on a different AWS availability zone. Therefore, if one worker becomes unavailable, then MuleSoft spins up a new worker in a different zone other than the zone where the other healthy worker is operational. If an AWS Availability Zone goes down entirely, a MuleSoft load balancer routes traffic to workers running on healthy availability zones. This will ensure zero downtime and zero impact of the disaster on business. If the workers become unavailable, then the load balancer will route the traffic to healthy nodes only.

Single Region Deployment Model — Disaster Recovery

This recovery strategy is depicted in the diagram above. If the initial application is deployed on two workers, then during deployment MuleSoft automatically allocates one worker from Availability Zone 1 and another worker from Availability Zone 2. If the worker on Availability Zone 1 goes down, then MuleSoft automatically spins up a new worker on Availability Zone 3 and allocates the same to the application. At any point in time, MuleSoft ensures, with high availability architecture, that if an entire Availability Zone goes down then workers from other availability zones will continue serving the applications. MuleSoft load balancers will automatically route the traffic to healthy workers on Availability Zones that are operational. With single region recovery strategy, MuleSoft does the heavy lifting and customers only need to ensure that they deploy applications on two or more workers to achieve a higher degree of availability and zero down time recovery.

MuleSoft CloudHub Multi Region Recovery

The MuleSoft CloudHub multi region recovery architecture is much more complex than single region recovery. Multi region recovery ensures that even if the entire AWS region becomes unavailable, then the backup or DR (Disaster Recovery) region will take over to manage business continuity. It is very unusual for an entire AWS region to go down completely, but some organizations plan for such a disaster with recovery planning.

Multi Region Deployment Model — Disaster Recovery

Within a specific region, be it the primary region or a DR region, the recovery mechanism and high availability features are the same as what was already discussed under single region recovery. However, the recovery strategy is different when one entire region goes down due to disasters. Please note that there should be a customer provisioned global load balancer (refer the diagram above) that will continuously check the health of the VPC regions and route traffic accordingly to primary and/or DR region, according to the customer’s multi region licensing model.

Before focusing on different multi region license models, let’s look into some additional key considerations that customers need to take into account if they planned for multi region deployment and disaster recovery strategy.

A global load balancer is responsible for routes between primary region VPCs and DR region VPCs and will be provided by customers. MuleSoft will not provision any global load balancers. Customers can use AWS Route 53 or any DNS router that can load balance across multiple regions.
Dedicate Load Balancer names need to be unique across the regions. Ideally, DLBs from primary regions and DR regions need to be casted with a customer vanity domain CName on the DNS server.
Application names need to be unique across the regions. For example, a CI/CD pipeline can be set up such that before deployment it can append the region name (or a region code) with applications to maintain the uniqueness in application name across primary and DR regions.
As part of deployment strategy, a CI/CD pipeline should be configured such that applications should deploy both on the primary region and DR region to ensure that all applications are available on both regions.
Ideally, the capacity of the primary and DR region needs to be identical for all infrastructure aspects.
Persistence storage needs to be periodically synced between two regions. Replication of storage needs should be implemented by the customer, not MuleSoft. Persistence storage can also be a single shared storage between VPCs from both primary and DR regions. However, customers need to ensure that the shared storage has high availability, otherwise this can be a single point of failure for both regions.
Object store V2 is regional and can not be accessed across the region . Other persistence storage options should be employed for multi regional disaster recovery strategy.
Proper DR test strategy should be adopted by the customer and periodical DR drill should to be conducted. In case of active/active DR, canary deployment strategy can be used to audit both regions.
Items such as firewall rules and VPN tunneling setup should be planned and maintained in such a way so that on-prem applications are reachable from both the VPC in the primary or DR region.
TLS certificate maintenance and management should cover VPCs in both primary and DR regions for a seamless recovery.
There can be some performance degradation as the DR region might not be as close to the end users of the MuleSoft applications or the back end systems as it was with the primary region. There should be a DR SLA agreement setup and communicated by the customer to their stakeholders.
In active/active mode, switching from the primary region to the DR region VPC is fast at the time of disaster. The global load balancer identifies the primary region that is not healthy and routes all traffic to an active healthy region (disaster region). However, in active/passive mode, you must manually deploy the application to disaster region VPC in case of diasaster. This is because, in passive mode, MuleSoft applications are not in deployed state on DR VPC runtime. This manual deployment effort consumes additional time to activate the DR region and I recommend using scripts to expedite the deployment if a customer has a significant amount of applications deployed.

Next, let’s focus on different multi-region licensing models.

Multi-Region Licensing Model:

Cold Standby: MuleSoft applications will be deployed on both the primary region and the DR region in CloudHub runtime. However, operating systems will not be running on DR regions. That means on DR regions, the CloudHub MuleSoft application will appear, but it is not deployed. If an outage is detected, the environment and MuleSoft runtime will be started in the DR region.

The advantage with this licensing model is that at any one point in time, the workers will be utilized from any one region and thus it is a less costly solution. On the other hand, once the primary region goes offline, all applications on the DR region need to be manually deployed. Even with scripts to do deployment in an automated fashion, this increases the RTO and there will be some time lag to switch from the primary to DR region VPC during a disaster.

Hot Standby: Active/Active: MuleSoft applications are deployed to VPCs on both primary regions and DR regions, and the environments (Operating System and Runtime) are up and running on both the regions. There are no contractual obligations to restrict traffic to be routed to applications on DR region VPC. Therefore, traffic load will be shared across both regions.

The advantage is that both regions are simultaneously available to accept and process traffic so this deployment model supports canary or blue/green deployment strategy and provides a higher degree of reliability. However, since both regions are active all the time it makes this solution the most costly out of the options.

Hot Standby: Active/Passive: MuleSoft applications are deployed to VPCs on both primary regions and DR regions, and the environments (Operating System and Runtime) are up and running on both regions. Technically, MuleSoft Applications deployed on DR regions can handle both ingress and egress traffic at any point in time. However, with contractual obligation, MuleSoft applications deployed on DR region will not receive any traffic unless the primary region becomes unavailable.

The advantage is that the switchover from primary region to DR region is very fast. In fact, the switch is made immediately after the primary region goes down. However, since applications are deployed on both regions, worker consumption is redundant which makes the entire licensing model costly (though some discounts may be applied at MuleSoft’s discretion). Also, application smoke testing on DR regions as part of post deployment validation is not possible when the primary region is active, so there can be some unforeseen issues that appear when the DR region becomes functional.

DR Architecture and Consideration of a Hybrid Deployment Model

DR strategy on a hybrid deployment model, where MuleSoft applications are running on both MuleSoft CloudHub and customer data centers, are complex but effective if implemented with proper planning. DR strategy for hybrid models need to cover the recovery for both CloudHub and on-prem recovery. As discussed earlier, if a customer is using a single region MuleSoft CloudHub Runtime, then for CloudHub recovery, a customer only need to follow a high availability architecture to deploy MuleSoft applications on multiple workers, even in hybrid architecture. The heavy lifting of recovery strategy implementation will still be done by MuleSoft. However, customers need to ensure a high availability strategy while setting up the connectivity between CloudHub Runtime and on-prem Runtime (or on-prem back-end applications) to ensure better reliability and quick failover in case the primary connection goes down. Customers often follow the DR strategy for on-prem MuleSoft Runtime that aligns with their overall datacenter DR strategy.

At a high level, we can depict a DR architecture for a hybrid model as follows:

High Availability and DR Architecture on Hybrid Model

For simplicity, the diagram is depicted with the CloudHub single region deployment model. However, the same scenario be imagined with CloudHub multi region deployment model with another CloudHub DR region and a global load balancer to control traffic between the primary region and DR region. Let’s now discuss the key considerations of this architecture.

Key Considerations

One primary and secondary VPN tunnel setup are required between CloudHub VPC and on-prem to ensure failover. The same is true if direct connect or VPC peering is used for CloudHub VPC to on-prem connectivity. However, both primary and secondary connection types needs to be the same. This means that both connection types need to be either IPSec VPN tunnel, direct connect, or VPC peering. The primary and secondary connection types can not be a combination of two different connection types. In a hybrid deployment model with multi-region CloudHub, eathech VPCs from both regions should be connected to on-prem with one primary and one secondary connection.

CloudHub VPC is connected to on-prem data center with one pair of VPN tunnels and to the on-prem DR Center with another pair of VPN tunnels.
Firewall whitelisting needs to be done for CIDR ranges on the CloudHub VPC from primary and DR regions, the on-prem data center, and the DR center. The VPCs, on-prem data centers, and DR centers should all be reachable to each other to ensure smooth failover.
Static IP addresses from both primary and DR region VPCs should be white listed on the on-prem data center and DR center firewall.
The Border Gateway Protocol (BGP) router should switch the connectivity (VPN tunnel) from data center to DR center during a failover situation with speed.
If the on-prem data center and DR center are working on active/passive mode, then switching over from data center to DR center might need some additional time.
The CI/CD pipeline should be set up such that all on-prem MuleSoft applications are deployed on both the data center and DR center.
Proper DR test strategy should be adopted by the customer and periodical DR drill should be conducted to continuously monitor the DR architecture. This will ensure quick recovery from any disaster situations without any surprises.

Conclusion

A robust disaster recovery and high availability strategy for the API platform is absolute necessary to make business offerings resilient. Setting up a disaster recovery strategy and architecture is complex and requires a lot of planning. MuleSoft disaster recovery must involve network architecture and the security, storage, business, architecture, and C4E teams, and other key stakeholders. Planning for MuleSoft disaster recovery to establish a resilient API platform across the enterprise is not a easy task. This article is an attempt to provide a starting point to plan for MuleSoft DR strategy. Please refer the links in the reference section for more details on this topic. I hope this will help you start on the DR planning for MuleSoft runtime.

Reference:

Mule Runtime High Availability (HA) Cluster Overview

For an equivalent to clustering in CloudHub, see CloudHub HA for details about how workers can be shared or doubled to…

docs.mulesoft.com

Introduction to MuleSoft High Availability and Multi-Region Disaster Recovery in CloudHub

To learn more about this topic, we recommend this training course: https://sfdc.co/bn4TGG.

training.mulesoft.com

MuleSoft multi-region deployment deep dive | Friends of Max Overview

To learn more about this topic, we recommend this training course: https://sfdc.co/caSCUZ.