Operational readiness in complex systems
Shaila discusses deploying complex changes to complex systems and the value a structured readiness approach can bring.
By: Shaila Abraham
“If I had nine hours to chop down a tree, I’d spend the first six sharpening my axe.”
- Abraham Lincoln
We will start with the story of the beginnings of Toyota Automotive Manufacturer. Toyota had the focus on manufacturing high quality vehicles that are reliable and safe with advanced technology. To deliver high quality products, the company empowered team members on the manufacturing floor to raise their hand and pull a cord when there is an issue. Thus, the team members were empowered to halt the system when a defect was identified and address it as opposed to letting it progress through in the manufacturing line. Doing so prevents the defect from moving down the line and impacting customers or requiring the operator to ask for permission to stop the progression.
“For much of Toyota’s history, we have ensured the quality and reliability of our vehicles by placing a device called an Andon Cord on every production line — and empowering any team member to halt production if there is an assembly problem. Only when the problem is resolved does the line begin to move again.”
- Akio Toyoda
With the above inspiration, we have similarly empowered our implementation teams to raise their hands on any issues they identify while building Aladdin Products and Services. In this blog post we are going to learn about the Operational Readiness Assessment Framework that helps us prevent adverse events and deliver high quality products to clients.
Why is Operational Readiness important for Aladdin Products and Services?
“Operational Readiness” is a certification by the implementation team indicating the system is ready to handle any failures in line with the defined performance and client requirements. An Operational Readiness Assessment ensures the operating environment is prepared to effectively support and accept the changes resulting from the new build out. It also defines how close the environment is to the desired readiness state that customers can use.
Implementing Operational Readiness Assessment in Aladdin Products and Services
To build a resilient system, we need to protect its critical capabilities from harm caused by adverse events and conditions. Before going over the details of the framework let us see what the current operational readiness practices are that most teams follow. The following are the checkouts that most teams perform:
- Infrastructure build-outs are complete
- Application development is complete
- Testing is complete
The above practice does not consider how to act in case of adverse events, who the responsible parties are, do we have all the required alerting or monitoring, etc. Any issues that arise after deployment to production are addressed as production issues and teams are operating in firefighting mode rather than planning in advance. Franklin Covey’s time management quadrants can be considered as an analogy. As shown in the diagram below, we were in Quadrant 1 and, with this framework, we will be moving over to Quadrant 2 which is the most effective way of managing operational readiness with the goal of being proactive.
When to perform Operational Readiness Assessment?
The operational readiness assessment should be completed by the implementation team before “pre-release” deployment to production for any green-field development or new feature development to existing Complex products or services. As part of the assessment, the implementation teams consider all the non-functional operational requirements and identify risks when they go through the framework.
High Performing Operational Readiness Assessment Framework
1. Identifying the error scenarios in the product or service architecture.
The success of any software architecture is defined by how well operational errors are managed in a system. Great software engineers understand that errors are part of the software development life cycle, and they know how to find the solution to each type of error when building the system.
The following are the error categories we have identified as part of operational readiness assessment for Aladdin Products and Services.
- Runtime error — A runtime error is a program error that occurs while the program is running. Examples include crashes caused by memory leaks or other programming errors.
- Shared component error — An error that occurred in a shared component and has the potential to have a wider outage.
- Network error — A network error is an error condition that caused a network request to fail.
- Application error — Application errors occur when a web-based or desktop application fails to work according to the agreed upon expected behavior
Sample Microservice Architecture with Error Categories
Once we have categorized the errors in the system architecture, our next step is to have the implementation team simulate the various errors identified. By simulating the error scenarios in the lower environments, the software engineers in the implementation team are empowered to “pull the cord” if there are any red flags. The team also records the following details for every error scenario identified:
- Likelihood of the error (low, medium, or high — frequency)
- Client impact
- System impact
- Data loss is incurred
- Failover scenario
- Recovery steps
2. Identify the teams responsible or parties to support when adverse events and conditions happen
When assessing the operational readiness, another consideration is to have our team responsible be prepared. The responsible teams are identified by the implementation engineers and should contain members from the various components of the architecture. We must have a list of teams responsible for the adverse scenarios the implementation team has identified and review the scenarios with the teams responsible so that they are prepared.
3. List the alerting mechanism used for the identified error scenarios
Our next consideration for an operational readiness assessment is the level of monitoring we have in place to alert the engineers when an error scenario happens, and the monitors we have in place to constantly monitor adverse events. Some of these alerts and monitors are actively used by multiple applications within complex systems and the implementation teams need to consider if their applications are onboard to the alerts or monitors.
Alerts
Alerting helps the product owners to be aware of the state of the system. There are two types of alerting: proactive alerting and reactive alerting. Proactive alerting encompasses health check alerts. Reactive alerts involve automated ways to deliver notifications to product owners to bring attention to adverse events or conditions in the system.
Health check alerts are important, but would you be concerned if there is a slight network connectivity issue? In case of minor network connectivity issues, we should have a defined failure threshold. When the threshold is breached then the health check alerts could potentially become alerts on adverse events.
It is the responsibility of the engineers in the implementation team to ensure all the necessary alerts are set up and the alerts are functioning as desired before deploying to production.
Monitors
Monitoring software not only tracks the operation of users and network services but can also help us to raise flags on adverse events. The implementation team is responsible for identifying the appropriate monitors required as well as ensuring the monitors are set up before production deployment.
Typically, complex systems have sophisticated infrastructure with loads of devices connected. These monitors enable the team to identify the exact component that has caused the issue in an adverse event.
4. Define automated recovery for the identified error scenarios when applicable and try to automate
Aladdin products and services involve complex integrations and require a high degree of precision. As the complexity of the task increases, it becomes increasingly unlikely that tasks can always be executed without errors. Preventing errors beyond a certain point is infeasible. For every error scenario identified by the implementation team, we investigate options of automating the recovery if possible. Specifically, we show that many operations can be automatically reversed. For example, errors can be handled through automated restart policy that is if a pod goes down, Kubernetes will bring it back up. By default, all the running pods have a restart policy set to “always” which means the kubelet running within a node will automatically restart a pod whenever the container encounters an error.
POD failed scenario
POD restarts automatically
Another example to consider is to prevent capacity-related failures (i.e., ensuring that the application always has the infrastructure resources it needs) by enabling the auto-scaling feature in Kubernetes. There are also scenarios where we considered writing custom scripts to automatically recover from failure, by converting the manual steps to be performed to scripts.
5. Execute Disaster Recovery (DR) Plan before go-live
Disaster Recovery tests involve simulating the failure of an Aladdin environment or an entire data center and recovering services in the secondary location. Disaster recovery tests are critical for any application that involves changes to infrastructure. When teams are building applications on an existing infrastructure, it is good to check if Disaster Recovery tracker is available and if the new onboarded application will be included in the periodic Disaster Recovery tests. If a Disaster Recovery plan is not available, the implementation team lead is responsible for making sure we have the Disaster Recovery tracker in place and have executed the Disaster Recovery scenarios for business continuity as stated in Business Continuity Plan before the products or services are released to production.
Sample Disaster Recover (DR) Tracker with DR scenarios executed before go-live:
6. Identify known risks and draft mitigation plan
An operation readiness assessment also includes the identification of any risks that the implementation team considers as high. These risks are detailed by the implementation team with mitigation steps.
An example to consider is that most companies have been working on Business Continuity Plan (BCP) for years, but never imagined that BCP would become a reality. With Covid-19 we have seen BlackRock’s BCP working successfully. The companies that did not have a BCP are the ones to scramble in the current times we are in.
The known risks can be categorized into the following:
- Operational Risk
- Reputational Risk
- Financial Risk
- Security Risk
Operational Risk — Complete recommended checklists that meet Operational targets
An operation readiness test for the environment should also analyze the performance and load requirements so that we meet our operational targets.
The following are some of the checklist(s) that helps mitigate operational risk:
- API Testing Checklist
- Release Checklist Pre and Post
- Production Readiness Checklist
- Deployment Readiness Checklist
Along with the checklists being reviewed and adhered to, as part of this framework any known operational risk identified by the implementation team is to be reviewed with the risk management team and recommended to obtain the signoff from the risk team.
Reputational Risk — Protection against Reputational Risk
We need to effectively manage any reputational risks as part of operational readiness assessment when identified by the implementation team. It is necessary that we protect ourselves against any reputational risk identified since it is the key to maintaining relationships with our clients. This is also critical to achieve our core principles. It is always the best practice to get any reputational risk reviewed by the risk management team and receive their signoff.
Financial Risk — Prevent Financial Risk
To prevent any potential financial risks to the company or to our clients, we need them to be identified as part of the operational readiness assessment. When financial risks are identified by the implementation team, it is necessary that we have a detailed plan on how to prevent them and measures in place accounted for. In turn, the prevention of any financial risk will not only give our customers a sense of safety but, being a fiduciary, it is our core responsibility. Any identified financial risks are to be reviewed with the risk management team and signed off.
Security Risk — Minimize Security Risk
To improve application security across Aladdin Products and Services and to minimize application security risks and comply with security standards, we should consider best practices and policies for various scans executed by the security team and receive their signoff as part of the operational readiness assessment. A successful production deployment includes the consideration to minimize any security risks. Minimizing security risks would help us to deliver quality products and services to our clients.
7. Complete check-outs in lower environments
Our engineers are not the sole members responsible for testing products or services before we release them into production. There is a high probability that many issues might not stand out to them. It’s much more efficient to have a set of automated scripts simulating the actual users so that we can identify any hidden issues with every deployment rather than relying on manual testing. These scripts will act as a check-out before every production deployment.
8. Have an operational runbook with failure recovery details ready
An operational runbook is a detailed “how-to” guide for completing a commonly repeated task or procedure within a company’s IT operations process. In the context of an operational readiness assessment, the operational runbook should be designed to be as simple as possible with key information required by the operations team to address the issue reported. A sample operational runbook template includes the following fields:
- Alert type (50X, 40X, …)
- Operational procedure
- Reason
- Client Impact
Once we have the runbook ready, it should be reviewed and signed off by the responsible operation team(s).
Is the framework working?
The following metrics show us the impact of the operational readiness assessment when conducted on the product. In the example below, the assessment was performed in June 2020 and, following the assessment, we see the drastic reduction in production priority 1 support tasks that the implementation engineers must firefight as compared to the previous years.
Like the above metrics, we have also seen a drastic reduction in production incidents that have the Command Center involved. The following graph shows us the impact of the operational readiness assessment on the reduction of production incidents after we implemented the framework. We can see a drastic reduction in production criticality high incidents as compared to previous years.
Summary
Performing an operational readiness assessment and certification by the implementation team is critical for all Aladdin products and services. We may have all the resources lined-up for building Aladdin products and services and we may be trying hard to meet the schedule, but if any of the software engineers in the implementation team notice red flags, they must be empowered to “pull the cord” and address any concerns during implementation rather than letting it go through and addressing it in production. It is much easier to address any known issue earlier during the implementation phase than to launch a product that is not ready. Launching a product that is not ready not only leads to customer dissatisfaction, but it is also costlier. Checking for operational readiness saves time, money, and reputation. One of the best ways to implement an operational readiness assessment for any complex product or service is to follow the framework detailed in this post as a best practice. Any implementation team should willingly adopt the framework considering the huge benefits of reduced firefighting mode for their teams, reduction of production incidents, mitigation of various risks, and team empowerment that we have achieved with this framework.
Learn more about tech careers at BlackRock.