Assess and Improve application resilience using “AWS Resiliency Hub”

Is your application Resilient?

Published in

Deloitte UK Engineering Blog

8 min readAug 2, 2023

AWS Resilience Hub. How Secure is yor cloud?

Imagine a sports goods retailer “BiRE Ltd” is expanding its business into fashion merchandising by acquiring the company FreshSoles (fictional) based in Italy.

FreshSoles’ website was down for a couple of days due to the recent floods in Italy, resulting in losses for the business. The disaster hit the business hard; they had data loss with significant downtime because of the long recovery time.

The team at BiRE had to find answers for their application resiliency, to avoid downtime in the future. Their prime focus was to find out “how quickly can the application be recovered if a disaster strikes?” This question was answered by configuring auto-scaling, high availability, and failover. However, all these approaches focused on better RTO (Recovery Time Objective) & RPO (Recovery Point Objective).

Although every application’s architecture has an RTO & RPO (or should have!), many people are unsure how these objectives are measured and monitored for production. As the monitoring of application and infrastructure metrics is implemented everywhere, in my experience, RTO & RPO values are often not being assessed. Especially when they specify the required resiliency of the application. The following sections cover how the RTO & RPO targets for an application can be validated.

What is Resiliency?

The Resiliency of an application is defined as the ability of a workload to withstand partial or intermittent failures across components.

Resiliency targets are calculated using below metrics:

Recovery Time Objective (RTO) — the time it takes to recover from a failure.
Recovery Point Objective (RPO) — the maximum window of time in which data might be lost after an incident.

Depending on business needs, these two can be measured in seconds, minutes, hours, or days.

A diagram visualising RPO & RTO over a timeline — Figure 1. Illustration of RPO & RTO

How does AWS Resilience Hub work?

The architects at “BiRE Ltd” leveraged AWS Resiliency Hub, to evaluate the application’s architecture’s RTO & RPO. Let’s see what the service is capable of, and the steps to configure.

AWS Resiliency Hub is a managed service where we can define RPO & RTO objectives to validate & track the resiliency of applications against those targets. Teams can discover potential resilience enhancements using this service. Further, this service helps optimise business continuity.

This service lets the user define RTO & RPO targets in a policy which is associated with the application & infrastructure to evaluate the configurations and provide recommendations to ensure the target requirements are met.

The service key capabilities include:

Assess:
a. Assessing whether the architecture meets the target RPO & RTO
b. An assessment that can be triggered as part of CI-CD (Continuous Integration — Continuous Deployment) pipelines and can be scheduled to run daily.
Resilience recommendations:
a. Assessing application components with suggestions on how to enhance the RTO & RPO and costs with minimal changes.
Operational recommendations:
a. Providing a guide to configure monitoring alarms, SOPs (Standard Operating Procedures), and AWS Fault Injection Simulator experiments through CloudFormation templates and Terraform scripts.

Supported resources

AWS offers many services, but only a few are supported by Resiliency Hub now (as of June 05, 2023). Details of the supported resources are illustrated in Figure 2.

Image showing the list of AWS services supported by AWS Resiliency Hub — Figure 2. AWS services supported by AWS Resiliency Hub

As the service continues to evolve, more services may be supported by AWS Resilience Hub. The current list of supported services can be found on the official page, here: https://docs.aws.amazon.com/resilience-hub/latest/userguide/supported-resources.html

Disruption types — resiliency scores

AWS Resilience hub monitors infrastructure and applications through a ‘resiliency score’. This score reflects how closely the application follows AWS recommendations for meeting resiliency policy, alarms, SOPs, etc. This score is used as a metric by AWS Resiliency Hub to indicate the ability of an application to withstand disruption.

Based on the type of resources each application uses, Resiliency Hub recommends alarms, and SOPs, for each disruption type. BiRE Ltd.’s architects leveraged this service to evaluate FreshSoles’ application to discover the resiliency score before and after implementing the AWS Resiliency Hub’s recommendations (see “FreshSoles’ Application Architecture” in Figure 5 below) to discover the application’s readiness.

Details of disruption types and possible disruptions of each type are shown in Figure 3. To improve resiliency scores, it is recommended to regularly implement recommendations.

This image illustrates the possible types of disasters — Figure 3. Types of Disaster

Resiliency Hub assigns a weight to each recommendation type for the total resiliency score. Table 1 & 2¹ shows the weights for alarms, SOPs, tests, meeting resiliency policy, and disruption types.

Image talks about the weight of alarms, SOPs, tests and resilient policy — Table 1. Weightage of alarms, SOPs, tests, policy target

Image talks about the weight of disruption types — Table 2. Weightage of each disruption type

Using Resiliency Hub

This flow diagram describes how one may use the AWS Resilience Hub service:

Image gives a walkover of a flowchart explaining the steps to follow to configure Resiliency Hub — Figure 4. A flowchart explaining the steps to follow to configure Resiliency Hub

Pricing details

Like most cloud services, AWS Resiliency Hub charges for consumption. It bills based on the number of applications assessed.

The free tier covers 6 months of usage. Outside the free tier, the first 3 applications are free with additional applications costing $15.00 per month. Pricing details are shown in Table 3² (as of June 05, 2023).

Image illustrates the pricing details of the service — Table 3. Pricing details of Resiliency Hub

For example: For assessing FreshSoles’ application (see “FreshSoles’ Application Architecture”), the monthly cost would be calculated:

Image gives the pricing details for FreshSole’s application — Table 4. Pricing details for FreshSoles’ application

FreshSoles’ Application Architecture

To understand how the team at BiRE improved their application resiliency, let’s check the architecture of their application. The diagram in Figure 5 shows the application’s AWS architecture:

This image presents the single zone architecture of FreshSoles’ application — Figure 5. Single zone architecture of FreshSoles’ application

As shown, the application consists of its web server running on EC2 with static data served from an S3 bucket. A NAT Gateway is used by the EC2 server to make calls to the internet. As illustrated, the server is currently running in a single Availability Zone (AZ). The architecture includes Elastic Load Balancing (ELB) & Auto-scaling to provision more servers as per the load and distribute the traffic.

This application is now assessed by Resilience Hub using the policy shown in Figure 6, which has the targets (RPO & RTO values) for the application and infrastructure supporting the app, to find the areas of improvements in architecture.

Image mentions the details of resiliency policy used to assess the application — Figure 6. Details of resiliency policy used to assess the application

The result of the assessment (shown below) has thrown multiple red flags, indicating that the policy (details are mentioned in Figure 6) has been breached. From the results, the application doesn’t meet the required RTO & RPO goals. Resilience Hub assesses for several failure types, including failures in the application, infrastructure, or AZ availability.

Image gives an overview of the assessment results — Figure 7. Overview of assessment results

The resiliency score for the application is generated and is seen as 0% (details in Figure 8) — which means plenty of scope for improvement!

Image gives the resiliency Score of the application — Figure 8: Resiliency Score of the application

The recommendations below illustrate the improvements that are needed in the architecture to address the red flags. The recommendations include enabling versioning to S3; backup of S3 objects, and configuration changes to the Auto-scaling group and ELB.

Images presents the Networking recommendations to application’s components — Figure 9.1: Resiliency recommendations to application’s components — Networking

Images presents the Compute recommendations to application’s components — Figure 9.2: Resiliency recommendations to application’s components — Compute

Images presents the Storage recommendations to application’s components — Figure 9.3: Resiliency recommendations to application’s components — Storage

The operational recommendations are presented in Figure 10. They recommend configuring CloudWatch alerts for the infrastructure. For example, alerts that are triggered when there is high utilization of CPU/memory of the server or status check failures of the web server with the operations team are notified through SNS topic.

Image presents the operational recommendations for the application’s components — Figure 10: Operational recommendations for the application’s components

The architecture shown in Figure 11 refers to the updated architecture upon implementing the Resilience Hub’s recommendations mentioned in Figure 9. The changes include modifying the auto-scaling group to provision servers in multiple availability zones and adding those to the target group behind the load balancer. Further, versioning is enabled for the S3 bucket and a backup plan with point-in-time recovery is configured.

Image gives the improved architecture of FreshSoles’ application — Figure 11: Improved architecture of FreshSoles’ application

The resiliency of the architecture is then reassessed. As seen in Figure 12, the red flags reported earlier (refer to Figure 7) have been addressed and the resiliency policy is now met. Finally, as seen in Figure 13, the resiliency score is now improved from 0% to 40%.

Image shows the improved architecture assessment results — Figure 12: Improved architecture assessment results

Image shows the Improved architecture’s resiliency score — Figure 13: Improved architecture’s resiliency score

Conclusion

As per the weighting mentioned in Table 1, the resiliency score of the application is 40% after the resiliency policy is met. Once the operational recommendations are implemented, the score will improve further. It’s clear that the application’s readiness is enhanced.

This document gives a walkthrough of how the team at BiRE Ltd leveraged AWS Resilience Hub service, how they used it to assess the FreshSoles’ application, how they identified areas for improvement, and how they implemented those recommendations to meet the target RPO & RTO goals.

The benefit of using this service is a reduced risk of an outage due to the assessment of the architecture. Further, whenever a change is rolled out to application/infrastructure another automatic assessment and report is created.

In this scenario, AWS Resilience Hub provided a comprehensive view of the application. It helped reduce complexity and clearly displayed the current risk state in a multi-faceted hub.

Overall, it was a painless process to understand and implement the changes to the architecture and enhance the application’s readiness to withstand a disaster.

In my next article, I will aim to cover the difference between Disaster Recovery and High Availability.

References

[1]: Attia, S. et al. (2016a) Dynamics and resilience of Informal Areas International Perspectives, Amazon. Available at: https://docs.aws.amazon.com/resilience-hub/latest/userguide/weight.html (Accessed: 04 July 2023).

[2]: Attia, S. et al. (2016a) Dynamics and resilience of Informal Areas International Perspectives: Pricing, Amazon. Available at: https://aws.amazon.com/resilience-hub/pricing/ (Accessed: 04 July 2023).

[3]: Attia, S. et al. (2016) Dynamics and resilience of Informal Areas International Perspectives, Amazon. Available at: https://aws.amazon.com/resilience-hub/ (Accessed: 04 July 2023).