Operational readiness in complex systems

Shaila discusses deploying complex changes to complex systems and the value a structured readiness approach can bring.

“If I had nine hours to chop down a tree, I’d spend the first six sharpening my axe.”

- Abraham Lincoln

We will start with the story of the beginnings of Toyota Automotive Manufacturer. Toyota had the focus on manufacturing high quality vehicles that are reliable and safe with advanced technology. To deliver high quality products, the company empowered team members on the manufacturing floor to raise their hand and pull a cord when there is an issue. Thus, the team members were empowered to halt the system when a defect was identified and address it as opposed to letting it progress through in the manufacturing line. Doing so prevents the defect from moving down the line and impacting customers or requiring the operator to ask for permission to stop the progression.

“For much of Toyota’s history, we have ensured the quality and reliability of our vehicles by placing a device called an Andon Cord on every production line — and empowering any team member to halt production if there is an assembly problem. Only when the problem is resolved does the line begin to move again.”

- Akio Toyoda

With the above inspiration, we have similarly empowered our implementation teams to raise their hands on any issues they identify while building Aladdin Products and Services. In this blog post we are going to learn about the Operational Readiness Assessment Framework that helps us prevent adverse events and deliver high quality products to clients.

Why is Operational Readiness important for Aladdin Products and Services?

“Operational Readiness” is a certification by the implementation team indicating the system is ready to handle any failures in line with the defined performance and client requirements. An Operational Readiness Assessment ensures the operating environment is prepared to effectively support and accept the changes resulting from the new build out. It also defines how close the environment is to the desired readiness state that customers can use.

Implementing Operational Readiness Assessment in Aladdin Products and Services

To build a resilient system, we need to protect its critical capabilities from harm caused by adverse events and conditions. Before going over the details of the framework let us see what the current operational readiness practices are that most teams follow. The following are the checkouts that most teams perform:

  • Application development is complete
  • Testing is complete

When to perform Operational Readiness Assessment?

The operational readiness assessment should be completed by the implementation team before “pre-release” deployment to production for any green-field development or new feature development to existing Complex products or services. As part of the assessment, the implementation teams consider all the non-functional operational requirements and identify risks when they go through the framework.

High Performing Operational Readiness Assessment Framework

1. Identifying the error scenarios in the product or service architecture.

The success of any software architecture is defined by how well operational errors are managed in a system. Great software engineers understand that errors are part of the software development life cycle, and they know how to find the solution to each type of error when building the system.

  1. Shared component error — An error that occurred in a shared component and has the potential to have a wider outage.
  2. Network error — A network error is an error condition that caused a network request to fail.
  3. Application error — Application errors occur when a web-based or desktop application fails to work according to the agreed upon expected behavior
  • Client impact
  • System impact
  • Data loss is incurred
  • Failover scenario
  • Recovery steps

2. Identify the teams responsible or parties to support when adverse events and conditions happen

When assessing the operational readiness, another consideration is to have our team responsible be prepared. The responsible teams are identified by the implementation engineers and should contain members from the various components of the architecture. We must have a list of teams responsible for the adverse scenarios the implementation team has identified and review the scenarios with the teams responsible so that they are prepared.

3. List the alerting mechanism used for the identified error scenarios

Our next consideration for an operational readiness assessment is the level of monitoring we have in place to alert the engineers when an error scenario happens, and the monitors we have in place to constantly monitor adverse events. Some of these alerts and monitors are actively used by multiple applications within complex systems and the implementation teams need to consider if their applications are onboard to the alerts or monitors.

Alerts

Alerting helps the product owners to be aware of the state of the system. There are two types of alerting: proactive alerting and reactive alerting. Proactive alerting encompasses health check alerts. Reactive alerts involve automated ways to deliver notifications to product owners to bring attention to adverse events or conditions in the system.

Monitors

Monitoring software not only tracks the operation of users and network services but can also help us to raise flags on adverse events. The implementation team is responsible for identifying the appropriate monitors required as well as ensuring the monitors are set up before production deployment.

4. Define automated recovery for the identified error scenarios when applicable and try to automate

Aladdin products and services involve complex integrations and require a high degree of precision. As the complexity of the task increases, it becomes increasingly unlikely that tasks can always be executed without errors. Preventing errors beyond a certain point is infeasible. For every error scenario identified by the implementation team, we investigate options of automating the recovery if possible. Specifically, we show that many operations can be automatically reversed. For example, errors can be handled through automated restart policy that is if a pod goes down, Kubernetes will bring it back up. By default, all the running pods have a restart policy set to “always” which means the kubelet running within a node will automatically restart a pod whenever the container encounters an error.

5. Execute Disaster Recovery (DR) Plan before go-live

Disaster Recovery tests involve simulating the failure of an Aladdin environment or an entire data center and recovering services in the secondary location. Disaster recovery tests are critical for any application that involves changes to infrastructure. When teams are building applications on an existing infrastructure, it is good to check if Disaster Recovery tracker is available and if the new onboarded application will be included in the periodic Disaster Recovery tests. If a Disaster Recovery plan is not available, the implementation team lead is responsible for making sure we have the Disaster Recovery tracker in place and have executed the Disaster Recovery scenarios for business continuity as stated in Business Continuity Plan before the products or services are released to production.

6. Identify known risks and draft mitigation plan

An operation readiness assessment also includes the identification of any risks that the implementation team considers as high. These risks are detailed by the implementation team with mitigation steps.

  1. Reputational Risk
  2. Financial Risk
  3. Security Risk

Operational Risk — Complete recommended checklists that meet Operational targets

An operation readiness test for the environment should also analyze the performance and load requirements so that we meet our operational targets.

  • Release Checklist Pre and Post
  • Production Readiness Checklist
  • Deployment Readiness Checklist

Reputational Risk — Protection against Reputational Risk

We need to effectively manage any reputational risks as part of operational readiness assessment when identified by the implementation team. It is necessary that we protect ourselves against any reputational risk identified since it is the key to maintaining relationships with our clients. This is also critical to achieve our core principles. It is always the best practice to get any reputational risk reviewed by the risk management team and receive their signoff.

Financial Risk — Prevent Financial Risk

To prevent any potential financial risks to the company or to our clients, we need them to be identified as part of the operational readiness assessment. When financial risks are identified by the implementation team, it is necessary that we have a detailed plan on how to prevent them and measures in place accounted for. In turn, the prevention of any financial risk will not only give our customers a sense of safety but, being a fiduciary, it is our core responsibility. Any identified financial risks are to be reviewed with the risk management team and signed off.

Security Risk — Minimize Security Risk

To improve application security across Aladdin Products and Services and to minimize application security risks and comply with security standards, we should consider best practices and policies for various scans executed by the security team and receive their signoff as part of the operational readiness assessment. A successful production deployment includes the consideration to minimize any security risks. Minimizing security risks would help us to deliver quality products and services to our clients.

7. Complete check-outs in lower environments

Our engineers are not the sole members responsible for testing products or services before we release them into production. There is a high probability that many issues might not stand out to them. It’s much more efficient to have a set of automated scripts simulating the actual users so that we can identify any hidden issues with every deployment rather than relying on manual testing. These scripts will act as a check-out before every production deployment.

8. Have an operational runbook with failure recovery details ready

An operational runbook is a detailed “how-to” guide for completing a commonly repeated task or procedure within a company’s IT operations process. In the context of an operational readiness assessment, the operational runbook should be designed to be as simple as possible with key information required by the operations team to address the issue reported. A sample operational runbook template includes the following fields:

  • Operational procedure
  • Reason
  • Client Impact

Is the framework working?

The following metrics show us the impact of the operational readiness assessment when conducted on the product. In the example below, the assessment was performed in June 2020 and, following the assessment, we see the drastic reduction in production priority 1 support tasks that the implementation engineers must firefight as compared to the previous years.

Summary

Performing an operational readiness assessment and certification by the implementation team is critical for all Aladdin products and services. We may have all the resources lined-up for building Aladdin products and services and we may be trying hard to meet the schedule, but if any of the software engineers in the implementation team notice red flags, they must be empowered to “pull the cord” and address any concerns during implementation rather than letting it go through and addressing it in production. It is much easier to address any known issue earlier during the implementation phase than to launch a product that is not ready. Launching a product that is not ready not only leads to customer dissatisfaction, but it is also costlier. Checking for operational readiness saves time, money, and reputation. One of the best ways to implement an operational readiness assessment for any complex product or service is to follow the framework detailed in this post as a best practice. Any implementation team should willingly adopt the framework considering the huge benefits of reduced firefighting mode for their teams, reduction of production incidents, mitigation of various risks, and team empowerment that we have achieved with this framework.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
BlackRockEngineering

Official BlackRock Engineering Blog. From the designers & developers of industry-leading platform Aladdin®. Important disclosures: http://bit.ly/17XHCyc