Operational Readiness Review Template

Towards Operational Excellence

Adrian Hornsby
Nov 11, 2020 · 8 min read

I want to express my gratitude to my colleagues and friends Ricardo Sueiras, Matt Fitzerald, and Boaz Ziniman for their valuable feedback.

Since I published my blog series , I received a relatively large amount of feedback and requests. One, in particular, stood out:

“Can you share an operational excellence review template?”

Operational Readiness Review

In this blog post, I will share with you my “lightweight” (but not so lightweight)

AnORR is a rigorous, evidence-based assessment that evaluates a particular service’s operational state and is often very specific to a specific company, its culture, and its tools. Yet, ORRs all have the same goal: help you find blind spots in your operations.

This template, which I hope will help you get started, is based on my two-decades of experience writing application software, deploying servers, and managing large-scale architectures. I have refined it over the years, helping customers operating software systems in the AWS cloud.

This ORR template is by no mean a complete one. Instead, treat it as a starting point for you and your company to get the ball rolling. The most important thing is to make you think about the different aspects of software operations to minimize the risks of failure once the code hits production.

How to use this ORR template?

As mentioned previously, this is not THE template — it is A template — so treat it more as a mechanism for regularly evaluating your workloads, identifying high-risk issues, and recording your improvements.

More importantly, make it yours. Add your own experience to it. Adapt it to your culture, to your needs.

Can you have the right answers to all questions?

Very unlikely at first, but over time it should be your goal. Again, it is more a learning path to support continuous improvement. Having ORR reviews makes it easy to save point-in-time milestones and track improvements to your operations.

Who should do an ORR?

ORR should preferably be done with the entire service team: the product owner, the technical product manager, backend and frontend developers, designers, architects, etc. Everyone who was involved in one way or another with the service. The more diversity, the better. We want to avoid confirmation bias as much as possible.

When should you do an ORR?

A formal ORR should be done before the initial service launch and after any significant technological change. It should be repeated periodically (about once per year) to ensure that things haven’t drifted away from operational expectations but instead improved over time.

How does an ORR differ from an AWS Well-Architected review?

While there are some overlaps, the AWS Well-Architected review provides customers and partners a means to evaluate and implement designs that can scale over time. It describes the key concepts, design principles, and for designing and running workloads in the cloud. ORR addresses and focuses on the operational aspect of a particular service.

Operational Readiness Review Template

The ORR template is organized as follows:

1 — Service Definition and Goals

e.g. number of users, sales, marketing, ad-hoc, …)

2 — Architecture

Call out the critical functionalities. Identify the different components of the system and how they interact with one another.

Describe the mechanisms and expectations.

(discuss bulkheads, cells, shards, etc.)

If you do, explain why and what is done to minimize the impact of failure.

3 — Failures and Impact

(fail-open vs. fail-closed)

(discuss in particular multi-AZ, self-healing, retries, timeouts, back-off, throttles, and limits put in place)

(ref. static stability)

4 — Risk Assessment

5 — Monitoring, Metrics & Alarms

List all of your alarms, with period and threshold, and the severity of each.

(discuss in particular if it is shallow or deep, if it uses cache, async vs. sync, etc., and the risks associated)

6 — Testing

Do you have tests before and after conducting code review? Do they run automatically, or are developers running tests manually?

What assumptions do you make on these?

7 — Deployment

Lists actions and estimated time in the deployment pipeline.

Why aren’t they automated? What are the risks associate with each of the touch-points?

How do these changes get approved? Do you have several people approving changes?

Does your deployment update/upgrade software in-place?

8 — Operations

(include timing expectations).

9 — Disaster Recovery

(e.g., war rooms, isolation, calls, internal & external communication)

(e.g., postmortem, correction-of-error, etc.)

That’s all for now, folks. If you want to download, fork, or suggest some changes, this template is on my GitHub account here. Please contribute and help me improve it.

I hope you’ve enjoyed this post. Thanks a lot for reading :-)

— Adrian

The Cloud Architect

Resilient, scalable, and highly available cloud architectures.