Are your cloud applications really resilient?

Appranix
Appranix
Published in
4 min readJul 13, 2021

Are you trying to achieve cloud application resilience using your existing backup and recovery mechanisms? We are not saying that backup and recovery are not required. In fact, you must have backup and recovery using newer cloud built-in capabilities, especially for critical databases where you have to have a mechanism to go back and recover at any point in time. And for critical VMs, to be able to recover from regional failures using all the cloud platform built-in functionality as opposed to relying on third-party data management platforms which inevitably introduce lock-in. However, it is also very important to be able to protect and recover an entire distributed application environment cloud resources — load balancers, gateways, security groups, configurations, and the dependencies so you can be confident that your cloud applications are truly resilient.

For instance, if you look at the cloud well-architected frameworks which are cloud provider-specific best practices — https://docs.microsoft.com/en-us/assessments/?mode=questionnaire&question=0&category=Reliability, achieving application reliability is more than just protection and recovery of VMs and databases. We are not going to solve all the well-architecture problems here, like operational excellence, security and cost optimization, and so on. But, we are talking about achieving better Reliability as one of the core pillars of the well-architecture for your applications on the cloud platforms.

Achieving better reliability on the cloud is only possible if you approach your application protection, not by looking from an infrastructure-centric perspective, especially when it is auto-scaled under a load balancer but from an application-centric perspective. Also, when that infrastructure, continuously and dynamically changes due to a variety of mechanisms, for example, your development groups, through multiple DevOps pipelines might change your cloud infrastructure using infrastructures code; your site reliability engineers, on the other hand, looking at either individual services or microservices might stabilize using their own tools, our applications or systems are trying to optimize application environments for security, reliability, and also for cost while taking care of reliability SLAs change the infrastructure using a multitude of tools — cloud CLI, console or third-party management systems.

Now, the biggest problem is really knowing all your cloud services. It’s very difficult to control what you can not see, obviously. When you know multiple entry points dynamically change your cloud environment, it’s very difficult to know all the services that are in play, all the configurations, and all the dependencies of those individual services that really make up your system in general. Now the real question here is, how do you really achieve the system SLA’s that are expected by your business leaders, your customers, and your partners? And how do you really achieve that, on a public shared cloud infrastructure, where a system can fail for many different reasons.

When we talk to the customers that are operating large financial services systems on a public cloud-like AWS or a project management SaaS platform company operates on GCP many different project management systems or consumer retail company that operates a retail partner e-commerce SaaS system, a distributed system for hundreds and 1000s of their partners that are manufacturing contact lenses on Azure public cloud infrastructure, there are various types of system failures. A failure could be introduced because of a single bad deployment, a misconfiguration of cloud service due to multiple configuration changes are applied on top of your nicely packaged and deployed cloud infrastructure environment, or a single cloud service failure that your system critically depends on, or a natural disaster or entire cloud region failure or a ransomware attack, or multiple multiple reasons why cloud applications fail. In these situations, you will have to really think if it is possible to use the existing backup and recovery mechanisms, or configuration management systems, or using cloud management systems and achieve high levels of SLA that businesses now demand.

Rapid recoveries from application downtimes to increase SLAs!

Recover fast from cloud misconfigurations and drifts, bad deployments, ransomware attacks, cloud services/region failures, or natural disasters!

Your business leaders do not care where you really operate your infrastructure whether it is on the cloud or on-prem or combination. Your customers do not care how you run your systems but all they want is an “Always-on application” which means that you cannot even afford to fail for five minutes over a month. Can you achieve that level of stability with the existing management or backup and recovery or even DR systems? This is why we created Appranix, dedicated to achieving cloud application resilience for applications that are dynamic, distributed, and are trying to take advantage of all the cloud platforms’ power. We don’t say we are leading the market in this space, analysts like Gartner, Enterprise Management Associates, TAG CYBER, and a number of enterprise customers are supporting us.

Original Blog Link: https://www.appranix.com/resources/blogs/2021/07/are-your-cloud-applications-really-resilient.html

Come and join us for a discussion on July 29th about achieving distributed application resilience on cloud-native systems like Kubernetes — https://dashboard.gotowebinar.com/webinar/4669190892595660560

--

--