App Resiliency with Disaster Recovery using Cloud Platform Services

Published in

Appranix

5 min readApr 5, 2019

“Within the next five years, there will be a major internet outage that impacts more than 100 million users for longer than 24 hours” — Gartner

With the advancement of resilient cloud infrastructure and services, infrastructure-as-code for immutability, combined with logging, smart alerting based on thresholds, we are now at a point to make use of all the good things about cloud for autonomous app resiliency.

Application Resiliency on Cloud Platforms

Organizations are at different stages of migration to cloud platforms. Some have been operating a large set of applications on cloud platforms already, some are fully committed to moving everything to AWS or GCP or Azure and others are experimenting with Kubernetes based immutable containers to speed up their application development and deployments. Many organizations are migrating to the cloud along with a hope of changing their culture with DevOps. And a lot of legacy IT operations teams are moving up to become Site Reliability Engineers.

Some organizations are, for the first time ever, hoping to achieve disaster recovery compliance using cloud infrastructures. Because, now more than ever, with on-demand compute combined with unified networking and storage, it is now possible to drastically cut down the risk, reduce non-compliance all at the lowest cost possible.

A Roadmap for Achieving Autonomous App Resilience

Infrastructure-as-Code (IaC) for Immutable Cloud Assemblies

IaC has become the default mechanism for many well known IT operations on cloud platforms; not only IaC is used for the initial deployment of the infrastructure but subsequent configuration changes as well. Even traditional datacenters vendors like VMware are introducing Cloud Assembly concepts https://blogs.vmware.com/management/2018/08/introducing-cloud-automation.html as a way to simplify the ever-growing set of the software-defined services with dynamic infrastructures to speed application delivery.

Cloud Assemblies State Management

Just like the application software code, once the infrastructure is created with IaC in the form of dynamic cloud assemblies, state of the launched infrastructure becomes crucial to manage. State management has become a key aspect of complexity management, for example, with Terraform IaC language, properly managing state is very crucial — https://tech.ovoenergy.com/complexity-in-infrastructure-as-code/. Cloud platform provider based IaC template languages such as CloudFormation, Azure Resource Manager take the pain of state management away a bit from the users but not entirely — https://aws.amazon.com/blogs/mt/recovering-aws-cloudformation-stacks-using-continueupdaterollback/. You have to have the state management under control to have a resilient cloud infrastructure for your applications.

Monitoring, Alerting, and Threshold Management for Resiliency

While provisioning complex stacks of infrastructure has been simplified and codified with IaC, monitoring, alerting and threshold management have become a lot more complex. Cloud providers have introduced several services to make it easier — https://aws.amazon.com/cloudtrail/ and https://cloud.google.com/stackdriver/ are great examples of native services that help set up the much-required monitoring, logging for threshold management to have autoscaling and descaling for applications resiliency. Without these services, achieving application resiliency even with all the goodies of the cloud platforms will be far fetched dream.

Need for Configuring Availability Zones and Cross-Regions

Cloud service failures are expected and some of them are well known. Microsoft Azure is good at documenting them properly — https://azure.microsoft.com/en-us/status/history/. A lot of traditional data center IT ops teams are still learning about cloud provider SLAs. Many seem to be surprised to know that the cloud provider is responsible for many of the availability scenarios, including backup and recovery. It is important to do proper architecture while migrating and re-platforming to the cloud. It is hard for many IT organizations to refactor after the migration is done. Good consulting and system integrators recommended by cloud providers are a good way to start. However, on-going management is up to the cloud operations teams within the organizations. Check out GCP documentation on Live Migration of VMs and how to set policies for your application infrastructure — https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options

App Resiliency with Autonomous App Level DR

With all the advancement described above with the cloud platforms, it is now possible to not only reduce the DR costs drastically but also increase the application level resiliency tremendously. Combining all the capabilities described above is possible for smart IT/cloud operations teams and new SRE teams with the right tools and time at hand. Also, they should be able to manage the ever-growing list of services coming from the cloud platform providers and application teams wanting to use more and more infrastructure every day. However, organizational compliance policies that have been around for decades for protection and disaster recovery are usually neglected when a particular business group decides to move to the cloud for agility reasons. Cloud operations teams are then forced to reckon with realities of non-compliance risks afterward when the applications disappear or attacked by ransomware or misconfigurations due to complexity. It is worthwhile to look at managed services that offer a completely automated app level disaster recoveries for cloud applications.

Govind Rangasamy, a serial entrepreneur, is founder and CEO of Appranix. With extensive experience in building products in the cloud automation and enterprise IT management space, Govind founded Appranix with a belief that existing infrastructure centric cloud and IT automation solutions are completely inadequate to handle application resiliency. Prior to Appranix, Govind was the CEO of FogPanel, a multi-cloud service management company that was sold to UST Global. Before starting FogPanel, Govind led Actifio’s cloud and resiliency solutions group. He also led products at Eucalyptus Systems, an AWS compatible open source private cloud leader. Prior to Eucalyptus, he successfully transformed HP’s Storage Management product line to be a leader in the Gartner Magic Quadrant

Find More Blogs at: https://www.appranix.com/resources/blogs/index.html

Contact Appranix

Email: sales@appranix.com

Website: www.appranix.com

Phone: +1 508–656–0756

App Resiliency with Disaster Recovery using Cloud Platform Services

Written by Appranix