Product Reliability — Is it Just a Matter of Perspective?
Think back to a time when you bought an expensive item on an internet shopping site. You add your items to the cart, hit the “Pay” button and then, Boom! An ugly page error occurs. In a panic you hit the back button but the page shows no indication of whether you’ve been charged or not. You think for a minute, hope it was just a mistake and click “Pay” again. Now you get a different error message! Thoughts immediately race through your head:
- Is the site secure?
- What if my credit card details are floating around on some foreign server?
- Has my personal information been compromised?
- Have I been charged twice?
- Will I even get the items I wanted or something else?
You then spend the next two days calling customer service, checking your bank accounts and if you don’t get any answers you most likely unleash a cathartic social media storm of rage and swear to never use the site again.
Hardly a great customer experience. But what if behind the confusing pages, everything is fine, the system handled the duplicate order gracefully and dutifully initiated the order process? What is your perspective in this case? You might still think the site’s unreliable and can’t be trusted — especially if the same thing happens again (even if everything works fine in the background).
Now imagine if instead of ugly and confusing pages, you receive:
- a thank you message for your order
- some information that the systems are experiencing heavy loads
- a message saying your information is safe
- and a promise to receive an order confirmation within a few hours
Your heart rate will probably stay lower and you’ll make a mental note to check your mailbox.
That’s the perception of Reliability, when your systems are in a failed state but they adapt and protect the customer experience
No company wants a customer to have a poor experience using their Products, especially after they have invested, built, marketed and released them. They know this leads to a decline in engagement with your other Products, your brand, or worse, they’ll leave and go to your competitors. Reliability of your Products is important. If your Products are not Reliable, then your organisation won’t be as successful as it can be.
For the sake of context, I’ll focus on digital Products made using software e.g. a website or mobile application, and not material goods. I’ll explain how you can understand Reliability both as a concept and a perspective, what influences its effectiveness, and propose ways to increase the Reliability of the Products you build in your organisation.
So, what do we mean by a Reliable Product?
A Reliable Product is something that provides a consistent, predictable experience when used or observed. If failure or unforeseen issues occur, the user is not left with a poor experience such as an ugly “Page not Found” error, or a blank screen with a spinning automated image. It should provide positive feedback in cases of failure — such as a reassuring message with a link to your transaction that you can check later, or a promise to follow up using another communication medium (i.e. SMS or email) once everything’s returned to normal.
Reliability can be a subjective concept and highly contextual, so let’s compare a simple tool with a modern website to illustrate the differences and in the process introduce another term — Resiliency.
The simple versus not so simple view of Reliability
A simple, single-function tool can be considered Reliable if it does what it should do, each and every time you use it, within the parameters of what it was designed for. It is dependable and in a very human context, trustworthy. It’s like an old hammer or workshop clamp your grandfather made, it’s going to be useful long after you are gone and will keep doing what it is supposed to do.
My grandfather made this clamp over 60 years ago, primarily for the purpose of clamping pieces of timber and steel he worked on at home. This was always its intended purpose. No more, no less. It won’t need modifications or changes to its form or function for the rest of its existence. It works strictly as designed, and will just keep clamping pieces of timber and steel whenever I need it to.
But there is one quality that this clamp doesn’t have. It’s not very Resilient. It is certainly tough (made of forged steel) and the screw mechanism can generate a lot of force. However, it can’t adapt to clamping other objects all that well. If it breaks, bends, stretches or gets left out in the rain and rusts, it’s going to be useless i.e. it won’t just recover and return to its previous form.
In other words a simple object, that is used for a simple purpose, which yields the same predictable result every time, can be considered quite Reliable — even if it is not very Resilient.
But a digital Product built by a company like SEEK is not simple like a G-clamp. Even Products that seem trivial on the surface are actually quite complex — once you consider the designing, engineering and testing that goes into them. Also, once they are deployed into a production environment for Customers to use, their performance is affected by fluctuating loads placed on them by Customer and Non-Customer traffic alike. In addition, other SEEK Products and Services depend on them too. In other words, think of a G-Clamp trying to clamp two objects that can morph into different materials, weights and even shapes at will. At the same time these objects are being used by several hundred to maybe a thousand people at any one time!
As digital Products are created with software it means they are not inherently Resilient either (software code can be as brittle or robust as the author likes). Moreover, the infrastructure configuration and runtime environments software Products execute on has a multiplying effect on their Resiliency too (both positive and negative), e.g. you can have the best written code in the world, but it won’t run well if your infrastructure and runtimes are stuck in a 1990’s time-warp.
Based on these examples you can see the Reliability of a complex object, like a software Product (which needs to constantly adapt to its environment), is dependent on how Resilient you engineer it to be. I’ll unpack that hypothesis a bit further as we go, but first let’s understand a little bit more about Resiliency as we’ve used that term frequently so far.
Resiliency under the covers
Resiliency can be generally thought of as a subset of Reliability and is defined as:
…a systems capability to build and sustain capacity for continuous adaptability and will possesses capabilities that enable it to adapt, synchronise, respond and learn¹
There is a lot packed into that statement — too much for this post — but it does serve as an aspirational target for any system critical to the operation of a business. In truth few Products built by software engineers will ever qualify as an archetypal case study for a Resilience Engineering paper, mainly because most software is written for business and not life-support systems. The significant engineering effort required to make them totally bullet-proof always has financial, time and resourcing trade-offs that require appropriate prioritisation.
Continual development of technology, such as Cloud Computing, means a base-level for Resiliency has become easier to achieve out-of-the-box (or at least easier to configure), in recent times. What started at the infrastructure level with virtual servers and software defined networks 5–10 years ago has increasingly become available at the executable code level (commonly known as Serverless or Function-as-a-Service technology) where all compute infrastructure and runtime environments are managed by cloud vendors and charged back like a utility such as gas or electricity. In other words the Resiliency of your solutions can be narrowed to focus a lot more on your actual code, and how well you can configure the cloud platforms they utilise. It doesn’t need to include how good you are at racking, cabling and configuring physical hardware too.
So how are Reliability and Resiliency related?
Achieving good levels of Resiliency in your software Products means engineering, configuring and factoring it into its design from the ground up. Any measure of Reliability of a Product therefore must consider the individual Resilience capabilities of its constituent and externally dependent parts, their ability to adapt to the forces acting upon them, environmental or otherwise.
From a systems perspective, we can reason that the Reliability of a Product is equal to the combined sum of Resiliency of those systems required to deliver the Product to your customers. But the systems perspective is just one side of the Reliability coin. It’s fine for people writing and maintaining systems, but if your customers don’t perceive your products as Reliable, they’re not going to engage with them.
The customer’s view
In any digital business, understanding what makes a customer perceive a Product to be Reliable (and ensures they will keep coming back) must involve people who are focused on solving customer problems every single day. At SEEK this role is entrusted to the Product Manager²
Involve your Product Managers when you define a Product’s Reliability. They’ll ensure that the Software Engineering teams understand the Customer success measures. When you involve Product Management in this process, the outcomes achieved are the establishment of Service Level Objectives (or SLOs). Getting into the practice of establishing SLO’s helps the Engineers building the Product to understand Reliability from the Customer’s viewpoint. It establishes a common set of guiding principles for building new Products and you can use it as a decision-maker on when to invest more in improving system Resiliency.
Tip: Don’t blindly establish a single set of SLO rules to govern everything and then manage it all by a centralised committee!⁴
Different Products have different Customer requirements. Whilst a simple macro rule stating requests should not take more than a 3–4 second round trip is a reasonable non-negotiable for all customer interactions, additional micro-level SLOs should be done within the teams. For this to happen, Product Managers within each team must peer with Engineering leads to establish and agree on them — commit to review them regularly, and ensure they receive equal importance with other work priorities and objectives. One of my Product Manager colleagues made it very clear to me when I asked her about what makes a great Technology Lead:
Great Technology Leads and Engineers are those who will always try to understand and appreciate the customer experience and their journey
Ensuring you are developing Reliability measures within the teams improves autonomy of decision making, gives your individual Reliability measures more purpose⁴ and, from a systems-thinking discipline, re-enforces the value of focusing on work-as-done versus work-as-imagined⁵. Your Products will evolve independently more easily and be more adaptive and responsive to changes.
Lastly it is important for teams to socialise and understand how Reliability is measured and engineered across teams as, this improves knowledge, understanding and helps teams collectively learn from each other⁶
Guiding principles for building Reliable Products
Fortunately SEEK has a sizeable Product team possessing a strong customer focus. Understanding the view of the customer, as it relates to Reliability of our Products, is subjective depending on the Products of course, but by asking Product Managers questions and, more importantly, getting a spot in their busy calendars, you can uncover some interesting observations.
We’ve conducted a number of interviews over the last few months and we found that Reliability is important, and the customers perception of a performant product that doesn’t fail with cryptic technical messages or confusing error pages is actually, really important.
Some insights we gained from these conversations were:
- Everyone can accept that failure is inevitable, even Product Managers, but we absolutely do not want the customer to perceive downtime or degradation when it occurs and have a negative experience on our product offerings
- Slow responses and occasional drop outs are just as bad as full outages, if not more so, as they aren’t easily detected and the blast radius of who is affected can be difficult to determine
- A “Page takeover” on failure is not a good look e.g. “Something went wrong, hit your back button and try again!”
- Not knowing if something has, or is, failing is bad. We don’t like the concept of a “Customer Monitoring System” i.e. when the customer finds out something is broken before we do and rings Customer Service⁷
- Product Managers are not IT experts, so we’re relying on the engineers to put in the right monitoring and metrics to tell us when our Customers are being negatively impacted
- A good incident management and post-incident-review process is a great idea, we should be doing these all the time, collating the data and using it to inform our decision making
- Hanging on to legacy systems that compromise or cause trade-offs when building new products sucks. If we need to get rid of something, tell us, but do it in language we can understand, aligned with processes we can follow
We asked Product Managers from different areas and got some excellent information unique to each of our Products. Some of this information also consisted of which sites and systems are important from a monetisation perspective, how technology people could better engage and lead engineering teams and much more.
With this information we can establish some broad guiding principles about what makes a good Product. This leads to all of us having better discussions with our teams around incident management and remediation and work prioritisation.
So how can we effectively measure the Reliability of our Products knowing what these guiding principles are?
Gathering metrics such as incident rates, customer service calls and reporting on the decline or increase in them is a common practice used to measure levels of Reliability. Unfortunately, this is a poor way to measure Reliability, because it gives no indication of how your Products may respond to future or unforeseen events. Metrics such as these are considered lagging indicators⁸ because they are only obtained after an event occurs.
The sign at the front gate boasting the number of days since the last recordable accident is more likely a sign of good luck than good safety management!⁹
Attempting to improve Reliability by basing your reporting and follow-up actions on lagging indicators will lead to reactive action cycles. Furthermore, teams acting on such indicators can lose sight of bigger issues as they optimise just to improve incident reporting. This lack of big-picture focus coupled with reactive actions from lagging indicators over-time, can cause a needlessly more complex system to emerge. One that is never consolidated and optimised properly, causing greater issues to your business as patches and band-aid solutions further obscure fundamental flaws in the system.
Lagging indicators are not all bad though, they are good for highlighting patterns and should indicate the need to establish more Leading indicators in areas of concern when they start to become frequent.
Here’s an example, consider a safety inspector arriving at the scene of a plane crash. There could be a number of reasons why the plane crashed but the data suggests that an engine fuel pump failed. The safety engineer recommends grounding the fleet and replacing all the fuel pumps. In theory it’s a good idea, but it is just a reaction to a single incident. What if there were other Reliability issues that caused the crash or better yet, could have indicated that the fuel pump was about to fail and avert disaster in the first place. Ideally, you want to act before-the-fact.
Establishing a mix of Leading and Lagging Indicators¹⁰ is a better course of action. With priority given to proactive Leading indicators — especially in systems critical to the function of your business. There are many ways to define Leading indicators. In our plane example, indicators could be measuring fluctuations in fuel pressure over time, identifying excessive electrical faults within fuel components, leanness/richness of exhaust fumes, stalling and so on. Increasing Reliability by acting on these Leading Indicators might involve reviewing and improving maintenance schedules, adding redundant fuel pumps, improving in-flight engine diagnostic tooling and so on.
But as we’re talking about building software Products and not flying planes, here are some proven software Leading Indicators:
- Lead Time — The time it takes from idea to delivery or issue identification to resolution. The shorter the better
- Deployment Frequency — How often software gets deployed. Regular deployments, irrespective of changes ensure that untracked changes do not grow over time
- Mean Time to Restore (MTTR) — The time taken to restore a system from failure. The shorter the better NOTE: this is not about the number of failures, just how fast you resolve them
- Change Fail Percentage — The rate of changes that cause failure as a percentage of deployment rates. The lower the better — high values indicate quality measures need improvement
You might find these useful too:
- Unplanned versus Planned Work — Unplanned work is a silent productivity killer, it slows change rates and diverts peoples attention away from valuable work¹²
- Technical Debt Backlog Pay-down — Are you heading toward Technical Bankruptcy or are you making quantifiable resiliency improvements?
- Chaos Engineering¹³ — The practice of proactively causing failure in live systems to measure resiliency
Complex systems will always operate with some degree of flaws. The potential for failure is a constant ever-present factor in day-to-day operations. “Normal Accidents” therefore are an inherent part of the system¹⁵
In software engineering things are going to fail, that’s just reality. Sometimes it can be small, sometimes it can be a nasty anomaly, like a tree suddenly falling in front of you driving down a freeway at top speed. Failure is unpredictable because scale and system complexity continually link up to throw new challenges. At SEEK we have collected detailed incident notes for two years when things have failed. We’ve identified common failure patterns, the most being attributed to design and system complexity and even a few nasty Dark Debt¹⁴ incidents that caused widespread damage.
What this data tells us with absolute certainty is that we are not infallible. More importantly it tells us that when we don’t expect failure, and it happens, we don’t respond anywhere near as well as we do when we have monitoring on known failure indicators.
Recognising hazards and successfully manipulating system operations to remain inside the tolerable performance boundaries requires intimate contact with failure. More robust system performance is likely to arise in systems where operators can discern the “edge of the envelope”.¹⁶
Given we know that things are going to fail, we can change our perspective of this outcome from a negative to a positive experience and learn from failure. Use incidents as a data-gathering process from Lagging indicators in order to engineer Resiliency into our software systems through establishing more Leading indicators. Measure customer experiences against Reliability principles to stay focused and most importantly, share learnings and proactively find out what we do well¹⁷
Software systems fail. The larger and more complex they become, the greater the effort is required to adapt, anticipate and proactively design them to ensure your Products deliver a reliable experience to your customers. In systems where change is constant, you can’t engineer for best-case only scenarios or no possible failure either. Engineering for resiliency is the best way to ensure the perception of reliability remains high, building trust, customer engagement and ultimately, making your organisation successful along the way.
- Establish guiding principles based on the customer journey and experience (gained through engaging with your Product Teams)
- Use Leading indicators to drive Resilience Engineering efforts
- Establish SLO’s within individual Product teams
- And finally, embrace failure as a method of learning to drive continuous improvement within your software delivery cycles.
Here is a simple but aspirational vision statement to get you started — good luck:
“<insert your company name here> customers will not experience any downtime or degradation of our sites and services when using our Products”
 D.Woods : Resilience is a Verb
 M.Kagan : INSPIRED: How to Create Tech Products Customers Love
 B. Meyers : Site Reliability Engineering
 S. Dekker : The Safety Anarchist Relying on human expertise and innovation, reducing bureaucracy and compliance
 Field Expert Involvement : Systems Thinking for Safety
 S. Dekker : Why do things go right?
 In cases like these our incident analysis information has shown our Mean-Time-To-Recovery is significantly worse than when our pre-configured monitors tells us there is a problem. Oops!
 Lead and Lag indicators
 T Mathis : Fallacies in the Safety Fable
 D. Heuther: Leading, Lagging indicators
 N. Forsgren, J Humble : Accelerate: The Science of Lean Software and DevOps
 E. Goldratt. The Goal. A process of ongoing improvement
 Principles of Chaos Engineering
 J. Allspaw : Dark Debt — extract from the http://stella.report
 C. Perrow — Normal Accident Theory
 R. Cook : How Complex Systems Fail
 Appreciative Enquiry