As a part of Software Development Lifecycle, the major focus of developers is mostly around the coding, optimising and testing the business logic. However, while or even before writing code for the application, not much attention is given to what would happen when application as a dependent/independent system would be up & running in production. This leads to lot of complexities, inefficiencies, unwanted surprises and challenges when the code is pushed to production — most of which could have been eliminated or reduced had we thought of them little early. There comes the necessity of Operability Review. By the term Software Operability — we mean the readiness of a new software to be deployed in production and start handling real traffic. The whole purpose of doing Operability Review of a service is to ensure that it runs Well once deployed, operates as smoothy as possible, very minimal manual intervention is needed to make it up & running.
At Indix HQ, we take operability seriously and engage everyone to have regular reviews around the operability items so that we don’t miss anything important and discover them late. In this blog, we will discuss the process we follow, the areas we focus while doing such reviews and how good this has proved to our product life cycle.
Elements of Operability Review
Before we begin to discuss how do we perform operability review in Indix & what are the areas we cover under that, let’s first try to classify various aspects of a system running in production. From a very high level, health of any system/ application can be categorized as follows -
- System Architecture
- Configuration management & deployment
- Data storage
Above are the areas which we should further break up into specific components & review them separately. Also note that, Cost is a critical area which is not mentioned above since it is not really an attribute of the system. But everything we do — we try to optimise the cost incurred due to the same. The following diagram shows the coverage of all the aspects under Operability:
Architecture & Deployment diagram:
The first thing is to have the deployment & architecture diagrams. Where architecture depicts the logical flow of request/ response inside the system, deployment diagram is focussed on the actual infrastructure. But if the system is not too complex, it makes sense to have them in a single diagram. Having these diagram/s helps people unfamiliar with the application to get a quick understanding of what is inside. This also helps in technical discussions around the other areas under operability.
Our entire infrastructure is in AWS so we really don’t have options in terms of hosting. Though we plan to evaluate other Cloud providers like Azure or GCE in future, EC2 is the choice at this point.
We have both docker & non-docker services running (On EC2, we are not using ECS) & we have applications running both on EC2 directly as well on a Kubernetes & Mesos cluster. We use Mesos heavily & several internal applications as well as some user-facing services are deployed in Mesos. Based on the nature of the new service, we decide whether it’s a right candidate to be deployed on Mesos or it should be on EC2 instances.
- If the service runs on multiple instances & needs load-balancing, then some of the important things to decide for the ELB are -
- Is the service user-facing ? If not, make sure to use internal ELB (Routing via private IPs) as network packets will traverse via public internet otherwise (which doesn’t make sense & might have security concerns).
- Make sure to enable Cross-AZ load balancing. This helps to distribute requests among the registered nodes irrespective of whether they are evenly spread across the AZs or not.
- Enable Connection Draining with appropriate Timeout settings. This is to ensure no interruption of the inflight requests when an instance is being taken OOR for maintenance or something.
- Set Idle Timeout appropriately. Idle Timeout is the duration of a connection can be in idle state before the ELB closes the socket. This applies to both client side as well as EC2 side sockets.
- Enable Access Log for ELB. This is very important as without the access logs, it becomes too difficult to troubleshoot issues in production systems when the requests are coming through load balancer. We have a S3 bucket which is designated to store logs of every ELBs under top level folders/ paths. It might also be useful to push these logs to your centralized logging systems for further analytics.
- Next is the Stickiness settings. Generally speaking, Stickiness is against the very concept of load balancing but sometimes one may need to use session affinity to ensure that every client requests in a session is routed to a particular instance. This is generally implemented using AWSELB/ application Cookies.
- Finally, what kind of load balancer is needed. If we need features like content based routing or multiple listener ports per EC2 nodes — ALB is the choice. But note that ALB doesn’t support TCP listener yet so only HTTP/s protocols can be used. Use ELB otherwise.
This is a very broad area & we generally take decisions based on the nature of the service & it’s SLA committed to customer/ internally (for internal systems). There are different areas to look into for ensuring Resiliency of your application or achieving High Availability -
- Deploy your instances in multiple regions & in multiple AZs in each of the regions.
- Figure out any SPOF in the deployment. If that blocks the real time response of the app, then this is not acceptable. We need to ensure replica of the same exists & is distributed across independent network infra. If the service is using RDBMS, then we generally use RDS which has Multi-AZ deployments. That ensures synchronous replication of the data to a standby instance in a different Availability Zone (AZ).
- The last but probably the most important is — Choice of EC2 Purchase type. We use Spot instances heavily to reduce our cost but that brings the additional difficulty of maintaining high availability. Depending on the nature, it may not be possible to use spot instances at all but most of the times the strategy of hybrid model works.
- We imitate real production scenarios & try to estimate the numbers for MTTD (Mean-time-to-detect)/ MTTR (Mean-time-to-recover)/ RPO (recovery point objective)/ RTO (recovery time objective) for our services.
Scalability is extremely important as it helps you to never over provision any resource & scale that out as & when necessary. Thats is an important requirement for optimising the cost of your infra. The only thing to ensure is that your design of the service should be horizontally scalable & not bound by a single m/c due to local data storage/ in-memory data etc. Also, Auto-Scaling is a great feature in AWS which allows to define Cloudwatch based criteria to scale-out or scale-in automatically.
Configuration Management & Deployment
CM Tool & Model:
We had used Chef for configuration management for a long time & today also, many of the systems are deployed using Chef client/server or chef-solo. But then at some point two tears back, we decided to move to Ansible due to its simplicity & developer friendly architecture. Today, any new infra is deployed using Ansible (And in few cases we started using terraform as well, for the provisioning part). Important thing to ensure here is that the Ansible roles/ playbooks are robust enough that we don’t need any manual intervention to bring up a system. Generally, we keep separate playbooks for provisioning & configuration management as the provisioning part is generally a one time thing whereas configuration management runs regularly on the production systems. You can find some of our Ansible roles published here.
Next important thing is to define the management model. We generally have CI/CD pipelines for deploying the configuration changes in push mode but there are cases where Pull mode is must. Consider an Auto-Scaling group — Instances can come up anytime due to a scale out activity & that needs CM to run right after it comes alive. This is taken care by running the same set of configuration management playbooks as part of UDF. Planned changes are still done via pipeline, where a code commit/ merge in master branch should trigger appropriate pipeline/s which then executes the changes in all the relevant nodes in the infra.
We use Thoughtworks GOCD as the only CI/CD tool at Indix. Any app should have separate Staging & Production pipelines which automates the testing/ deployment of both Staging & Production clusters.
There are few standard security measure exist in AWS like using Vpc (Classic is no longer an option anyway), using apprpriate ACLs in the subnets & appropriate inbound/outbound ports in the Security Groups. Generally ACLs are not touched and most of the Allow rules are applied to SecGroups. We use separate SecGroups for separate services so that modifying one doesn’t impact anything else. Port requirements are reviewed & SecGroups are created accordingly.
Next thing is ensure right management of secrets like database passwords & similar, which should never be stored in plaintext when used by the CM tool or deployment pipelines. For example, Encryption key in Chef (Used for encrypted databags) & Vault key (Used by Ansible Vault) should be used to encrypt thise passwords and then can be stored in version control.
Now the next tricky part is how to manage those encryption keys. Of course they should never be committed to version control but have to be transferred to an instance while bootstrapping. For this, we use Iam roles which allow AWS resources to be accessed without using Access/Secret keys. So the strategy is to store Encryption Secrets in S3 bucket & then launch instances with Iam role with appropriate plocies which can download those keys during bootstrapping.
Finally, HTTPs is becoming more & more inportant with HTTP/2 even for pages without any sensitive content. At present, we use Certificates from both ACM (AWS Certificate Manager) as well as Let’s Encrypt. Since ACM is limited to only ELB or Cloudfront, we use LE certificates for externally hosted Indix end-points.
At Indix, we deal with massive amount of data & generally every system in the infrastructure has to deal with some sort of relational or unstructured data. Depending on the requirement of the app, we decide which storage mechanism out of S3/ Ephemeral/ EBS/ EFS fits the bill. This generally depends on the latency of retrieval/ Durablity requirement of the data etc. While using S3, we further decide whether it should be standard S3 or RRS or IA type. Similarly for EBS, we have multiple options like old magnetic volumes or new generation storages like io1/ gp2/ st1 or sc1. If properly chosen, these decisions can have significant impact on your cost while maintaining the latency/ durability requirement of the data. Finally in few cases we use EFS as well, which is apprpriate to be used as a shared storage with less latency than S3 & can be mounted in more than one EC2 instances (Unlike EBS).
Having some idea about the growth of your data & retention is also important. Accordingly one can define proper Life Cycle policy for S3 buckets to avoid waste of dollars by storing old & unused data. Increasing capacity of EBS on the fly is possible now but nevertheless it helps to allocate the right capacity from the beginnig. EFS scales automatically so thats not a concern there.
Backup & Recovery is naturally very crucial, we need to ensure that right backup strategy is in place. For SQL data store in RDS, its important to ensure daily backup is in place. For data stored in filesystems, a regular snapshot of EBS volumes needs to be ensured. We use cronjobs & lambda functions to schedule such EBS snapshots. The concepts of MTTD/ MTTR/ RPO/ RTO appliy here similarly.
Monitoring requirement is analysed both from System & Application aspects. While every service should have standard system monitoring like LoadAverage/ Memory/ Disk-Space/IO etc, application layer monitoring is unique to every app & needs to be identified accordingly. Apart from monitoring the subsystems, an end-to-end check is always must.
- For the internal monitoring platform, we have been using a combination of Sensu & Cloudwatch but recently we are working on migrating to Riemann which is better in terms of real-time monitoring. Also, we use Pingdom for monitoring our services from external to our infrastructure. The new monitoring platform is going to use Telegraf/Riemann/InfluxDB/Grafana where every system is supposed to be sending metrics by telegraf agent to Riemann server, which in turn has the alerts configured & also sends metrics to InfluxDB. Then create Grafana dashboards to have visualization on top of InfluxDB for analysing the trend. This also helps to do capacity planning for your systems. The Alert Notification systems are mainly Email/ Slack & pagerduty — And decided based on the severity of the issues.
- For logging, we use hosted service from Loggly. We have Chef cookbook & Ansible playbook which we customize for every app which pushes the data to Loggly. Then creating dashboards with appropriate set of filters needs to be done in Loggly. Not every service is integrated with Loggly yet, but irrespective of that — it’s important to have proper log rotation strategy so that disks are not unnecessarily filled up by old & useless logs.
Performance/ Load testing an application is clearly necessary in order to identify what instance type in AWS is the right fit. There are various tools out there & we generally use Locust or Siege or Apache Benchmark for http stress testing. Both vertical & horizontal scaling needs to be tuned based on the load test results until latency meets the SLA committed to customers. Another useful tool here is gor which helps to run performance tests with the live production requests with various kind of filtering.
So this summarises the areas we analyse to make sure we have a good operable app going to be deployed in production. The points discussed above are not limited to new softwares, but can be done in more or less same fashion against the currently running systems in production as well. This is a simple checklist we follow at Indix & initiate the process during the very early phase of application development. There is always a room for improvement and this kind of process matures over time but based on our experience so far — this approach has proved extremely useful & helped us to figure lot of problems both proactively & reactively before they are deployed once they started running in production.
Thanks to Arijit Bhattacharyya our Director of Engineering, DevOps for laying the foundation of Operability Review process at Indix. All the practices mentioned in the blog are implemented as a standard checklist for any system (internal or customer facing) as a part of his review process only :)