Reliability Pillar: Well-Architected Framework
This is a continuation of the Well-Architected framework blog and here I am discussing the Reliability pillar.
The Well-Architected Framework in Amazon Web Services has been developed to assist cloud architects to build secure, high-performing, resilient, and efficient infrastructure for their applications. There are 5 pillars of this framework, which are — operational excellence, security, reliability, performance efficiency, and cost optimization. This framework helps cloud architects and customers alike in evaluating architectures and implement designs that will scale over time.
Let’s talk about the reliability pillar in depth.
Reliability pillar consists of the ability of a system to recover from infrastructure or service disruptions, automatically/dynamically acquire computing resources as per real-time demand and lessen the effect of disasters dealing with infrastructure or network.
Reliability pillar of this Well-Architected Framework, also referred to as WAF focuses on three areas : Foundations, Change Management, and Failure Management. In short, we can say, if a system needs to be reliable then it needs to have a well-planned foundation and monitoring systems from the bottom to the top level, and mechanisms to handle planned changes. The system should be planned in a way to detect failures and automatically recover from any unexpected failure as well.
Let’s talk about the 3 main areas that we had mentioned earlier in a bit more detail.
Before we design any system, extra care must be given to the foundation parameters which will affect the complete functioning of the system. For example: before laying the fiber optic cable for an internet connection, the estimated number of customers must be kept in mind and their location as well. In such a planned way, the correct length of cable can be used and there will be no guesswork in that department. With AWS, most of the foundation parameters are already addressed automatically as needed. We have the freedom to select the compute power and size of storage and thus, care must be given while selecting that for different projects.
In traditional environments, changes had to be carefully monitored and the whole process needs to be coordinated with all the teams. With the help of AWS, we can monitor the real-time changes and work on the changes proactively without affecting any system functions. We can also automate the response of the system, for example, decrease the number of servers used when the traffic to the website is low.
Any system of fairly good complexity will face failures at any moment of time. In a traditional IT environment, it is an expensive and time-consuming process to set up the system just to check the level when failures will occur. But with AWS, we can spin-up the exact system configurations that we wish to test and check for any failures. System monitoring can be performed at every level using automation. For example, with the help of automation, we can see the system rectify the error itself once any threshold is crossed. In that case, we can manually set the threshold.
The AWS service that is essential to reliability is Amazon CloudWatch. Amazon CloudWatch monitors the Amazon Web Services (AWS) resources and the applications we run on AWS in real time. We can use CloudWatch to collect and track metrics, which are variables we can measure for the resources and applications. There are other options in this space including Datadog, Dynatrace and NewRelic.
The CloudWatch home page automatically displays metrics about every AWS service you use. You can additionally create custom dashboards to display metrics about your custom applications and display custom collections of metrics that you choose.
Limit management — This deals with the physical limitations that the system might face once it starts functioning. So before creating the whole architecture, things like network bandwidth and storage capacity must be taken care of. We can use the AWS service named Trusted Advisor checks for testing the adequacy of the architecture for performance, security groups and service limits.
Network topology planning — This consists of all the topology factors which include the number of IP addresses that we might need in the future and also the systems and networks that we would want to integrate with. We should also make the architecture safe from any possible attacks, misconfigurations and resilient in the face of an unexpected surge in traffic.
Monitoring — We should learn how to access logs of any service in case checks needs to be performed. The key AWS service that supports monitoring is Amazon CloudWatch(already discussed above), which can be used for the creation of alarms that can be set to automatically trigger scaling actions.
AWS Trusted Advisor — This is a great service offered by AWS for us to check the complete architecture at one go and look for any warnings and failures. Though this service is not free, AWS offers some checks by TA for free, namely, Service Limit checks, security groups-specific ports unrestricted checks, IAM use check, MFA on root account check. Given below is the dashboard of TA for a sample system with some warnings and errors.
As you can see above, TA shows the areas of concerns related to cost, performance, security and fault tolerance. This tool can be of great help for any system architect who wants to have that detailed look into the complete system.
Amazon CloudWatch is an awesome service provided by Amazon which can be included in the reliability pillar of the well-architected framework. It’s a monitoring tool about which you can read here.
So far we have discussed the importance of the reliability pillar of the well-architected framework in any given AWS architecture. I would like to mention here that we have also developed an open-source code for scanning your AWS cloud resources and generate reports. Do give it a try and get a deep insight into your AWS environment.