Assess and Improve application resilience using “AWS Resiliency Hub”
Is your application Resilient?
Imagine a sports goods retailer “BiRE Ltd” is expanding its business into fashion merchandising by acquiring the company FreshSoles (fictional) based in Italy.
FreshSoles’ website was down for a couple of days due to the recent floods in Italy, resulting in losses for the business. The disaster hit the business hard; they had data loss with significant downtime because of the long recovery time.
The team at BiRE had to find answers for their application resiliency, to avoid downtime in the future. Their prime focus was to find out “how quickly can the application be recovered if a disaster strikes?” This question was answered by configuring auto-scaling, high availability, and failover. However, all these approaches focused on better RTO (Recovery Time Objective) & RPO (Recovery Point Objective).
Although every application’s architecture has an RTO & RPO (or should have!), many people are unsure how these objectives are measured and monitored for production. As the monitoring of application and infrastructure metrics is implemented everywhere, in my experience, RTO & RPO values are often not being assessed. Especially when they specify the required resiliency of the application. The following sections cover how the RTO & RPO targets for an application can be validated.
What is Resiliency?
The Resiliency of an application is defined as the ability of a workload to withstand partial or intermittent failures across components.
Resiliency targets are calculated using below metrics:
- Recovery Time Objective (RTO) — the time it takes to recover from a failure.
- Recovery Point Objective (RPO) — the maximum window of time in which data might be lost after an incident.
Depending on business needs, these two can be measured in seconds, minutes, hours, or days.
How does AWS Resilience Hub work?
The architects at “BiRE Ltd” leveraged AWS Resiliency Hub, to evaluate the application’s architecture’s RTO & RPO. Let’s see what the service is capable of, and the steps to configure.
AWS Resiliency Hub is a managed service where we can define RPO & RTO objectives to validate & track the resiliency of applications against those targets. Teams can discover potential resilience enhancements using this service. Further, this service helps optimise business continuity.
This service lets the user define RTO & RPO targets in a policy which is associated with the application & infrastructure to evaluate the configurations and provide recommendations to ensure the target requirements are met.
The service key capabilities include:
- Assess:
a. Assessing whether the architecture meets the target RPO & RTO
b. An assessment that can be triggered as part of CI-CD (Continuous Integration — Continuous Deployment) pipelines and can be scheduled to run daily. - Resilience recommendations:
a. Assessing application components with suggestions on how to enhance the RTO & RPO and costs with minimal changes. - Operational recommendations:
a. Providing a guide to configure monitoring alarms, SOPs (Standard Operating Procedures), and AWS Fault Injection Simulator experiments through CloudFormation templates and Terraform scripts.
Supported resources
AWS offers many services, but only a few are supported by Resiliency Hub now (as of June 05, 2023). Details of the supported resources are illustrated in Figure 2.
As the service continues to evolve, more services may be supported by AWS Resilience Hub. The current list of supported services can be found on the official page, here: https://docs.aws.amazon.com/resilience-hub/latest/userguide/supported-resources.html
Disruption types — resiliency scores
AWS Resilience hub monitors infrastructure and applications through a ‘resiliency score’. This score reflects how closely the application follows AWS recommendations for meeting resiliency policy, alarms, SOPs, etc. This score is used as a metric by AWS Resiliency Hub to indicate the ability of an application to withstand disruption.
Based on the type of resources each application uses, Resiliency Hub recommends alarms, and SOPs, for each disruption type. BiRE Ltd.’s architects leveraged this service to evaluate FreshSoles’ application to discover the resiliency score before and after implementing the AWS Resiliency Hub’s recommendations (see “FreshSoles’ Application Architecture” in Figure 5 below) to discover the application’s readiness.
Details of disruption types and possible disruptions of each type are shown in Figure 3. To improve resiliency scores, it is recommended to regularly implement recommendations.
Resiliency Hub assigns a weight to each recommendation type for the total resiliency score. Table 1 & 2¹ shows the weights for alarms, SOPs, tests, meeting resiliency policy, and disruption types.
Using Resiliency Hub
This flow diagram describes how one may use the AWS Resilience Hub service:
Pricing details
Like most cloud services, AWS Resiliency Hub charges for consumption. It bills based on the number of applications assessed.
The free tier covers 6 months of usage. Outside the free tier, the first 3 applications are free with additional applications costing $15.00 per month. Pricing details are shown in Table 3² (as of June 05, 2023).
For example: For assessing FreshSoles’ application (see “FreshSoles’ Application Architecture”), the monthly cost would be calculated:
FreshSoles’ Application Architecture
To understand how the team at BiRE improved their application resiliency, let’s check the architecture of their application. The diagram in Figure 5 shows the application’s AWS architecture:
As shown, the application consists of its web server running on EC2 with static data served from an S3 bucket. A NAT Gateway is used by the EC2 server to make calls to the internet. As illustrated, the server is currently running in a single Availability Zone (AZ). The architecture includes Elastic Load Balancing (ELB) & Auto-scaling to provision more servers as per the load and distribute the traffic.
This application is now assessed by Resilience Hub using the policy shown in Figure 6, which has the targets (RPO & RTO values) for the application and infrastructure supporting the app, to find the areas of improvements in architecture.
The result of the assessment (shown below) has thrown multiple red flags, indicating that the policy (details are mentioned in Figure 6) has been breached. From the results, the application doesn’t meet the required RTO & RPO goals. Resilience Hub assesses for several failure types, including failures in the application, infrastructure, or AZ availability.
The resiliency score for the application is generated and is seen as 0% (details in Figure 8) — which means plenty of scope for improvement!
The recommendations below illustrate the improvements that are needed in the architecture to address the red flags. The recommendations include enabling versioning to S3; backup of S3 objects, and configuration changes to the Auto-scaling group and ELB.
The operational recommendations are presented in Figure 10. They recommend configuring CloudWatch alerts for the infrastructure. For example, alerts that are triggered when there is high utilization of CPU/memory of the server or status check failures of the web server with the operations team are notified through SNS topic.
The architecture shown in Figure 11 refers to the updated architecture upon implementing the Resilience Hub’s recommendations mentioned in Figure 9. The changes include modifying the auto-scaling group to provision servers in multiple availability zones and adding those to the target group behind the load balancer. Further, versioning is enabled for the S3 bucket and a backup plan with point-in-time recovery is configured.
The resiliency of the architecture is then reassessed. As seen in Figure 12, the red flags reported earlier (refer to Figure 7) have been addressed and the resiliency policy is now met. Finally, as seen in Figure 13, the resiliency score is now improved from 0% to 40%.
Conclusion
As per the weighting mentioned in Table 1, the resiliency score of the application is 40% after the resiliency policy is met. Once the operational recommendations are implemented, the score will improve further. It’s clear that the application’s readiness is enhanced.
This document gives a walkthrough of how the team at BiRE Ltd leveraged AWS Resilience Hub service, how they used it to assess the FreshSoles’ application, how they identified areas for improvement, and how they implemented those recommendations to meet the target RPO & RTO goals.
The benefit of using this service is a reduced risk of an outage due to the assessment of the architecture. Further, whenever a change is rolled out to application/infrastructure another automatic assessment and report is created.
In this scenario, AWS Resilience Hub provided a comprehensive view of the application. It helped reduce complexity and clearly displayed the current risk state in a multi-faceted hub.
Overall, it was a painless process to understand and implement the changes to the architecture and enhance the application’s readiness to withstand a disaster.
In my next article, I will aim to cover the difference between Disaster Recovery and High Availability.
References
[1]: Attia, S. et al. (2016a) Dynamics and resilience of Informal Areas International Perspectives, Amazon. Available at: https://docs.aws.amazon.com/resilience-hub/latest/userguide/weight.html (Accessed: 04 July 2023).
[2]: Attia, S. et al. (2016a) Dynamics and resilience of Informal Areas International Perspectives: Pricing, Amazon. Available at: https://aws.amazon.com/resilience-hub/pricing/ (Accessed: 04 July 2023).
[3]: Attia, S. et al. (2016) Dynamics and resilience of Informal Areas International Perspectives, Amazon. Available at: https://aws.amazon.com/resilience-hub/ (Accessed: 04 July 2023).