Reliability in Software Engineering
Building Software and Processes for Unreliable Scenarios
Even though the software industry has seen many innovations, achievements, complex systems, and experienced veterans— it is yet so young when compared to other fields and practices.
Building software itself might be trivial in some circumstances, but in order to deliver true state-of-the-art quality, we first have to engineer it, and it’s not all about the services, but also the teams and human factors behind it.
Without that, it would be questionable whether we can build quality systems.
Reliability is one of the characteristics describing quality software and outlines the difference between the good and the bad.
Reliable software is a system that is tolerant to failures or even failure-free.
It’s the probability of issue-free operations of a component for a given period in a given environment.
There’s one saying that still sticks with me when thinking about reliable systems:
Reliability is flipping the switch and knowing the lights will come on.
Reliability is not yet another metric, but it is itself the way how customers experience the system.
It’s not a static characteristic. It’s dynamic in the sense that a system can withstand failures during various circumstances, both expected and unexpected.
Resilience is the way and reliability is the outcome.
A system can not be reliable without being resilient, and the other way around.
Depending on what part of the world you reside in, you might often hear these two words used interchangeably. Still, there is some difference to it.
The resilience of a software system is actually measured by how good and how much it can withstand, and mitigate threats, and how fast it can recover.
Labeling a software system as reliable requires us to ensure both operation and design of the components in place.
Regardless of the industry, company size, priorities, and customers, everyone wants their software components and platforms to work at all times.
Design for Reliability
Designing and building reliable software it’s the responsibility of everyone involved, whether that may be UI/UX designers, product managers, engineers, architects, or the underlying infrastructure and hardware provider.
In order for a software to be reliable, it has to be first designed to:
- Minimize failures
- Minimize the impact of failures
- Ensure uptime
Components have to stay functional during both predicted and unpredicted scenarios.
Reliable systems are built to be trusted.
During the design phase of a service, component, or product — we already have to start striving for reliability.
Design imperfections can lead often to confusion, and unnecessary complexity and can cause a good ground for brittle solutions.
In the concept phase, there are two Reliability models that you can make use of:
- Software Reliability Goal Setting
- Software Reliability Program Plan
Set reliability to be a goal, review and analyze decisions during the design, conception, and development too, and results will greatly improve before you even started shipping a product.
Make use of other techniques that are used in designing reliability such as:
- Team Design Reviews
- Software Failure Modes and Effects Analysis
- Software Fault Tree Analysis
- Software Fault Tolerance Analysis
“Design for Reliability” itself is a model, often part of the Design for Excellence.
We wave to recognize the common threats, in order to improve a system and design it to be adaptable and ready to mitigate issues.
A common set of threats for every system can be the following:
- Security threats
- Heavy dependency on other services
- Inter-service communication
- Unhandled issues and edge cases
- Environmental changes
- Hardware failures
- Inconsistency in both performance, and behavior
The list is a very rough one and can go to a pretty extent.
Reliability is often about everything right and wrong in a software system. All factors that can cause it to fail are threats in essence, from security, bad design, and limited hardware resources, to incorrectness and unpredicted issues.
From experience, failures are caused by:
- Neglected threat analysis
- Technical incorrectness
- Overengineering and overdesign
- Infrastructure errors
- Lack of testing
- Low resilience to development changes
There is a reason why great software is easily maintained, simple, effective, and robust. But it has to be designed and built like so.
Lack of ability to recover from failures is a major factor as well and influences the system’s reliability very often.
Thousand of times all of us encounter software that just doesn’t work.
It doesn’t break, it doesn’t throw a flag, notifying you that there’s an error — but for some reason, it just doesn’t do what it clearly is intended to. This implies unfairness, bad design, and zero anticipation for its behavior and screams “I’m bad at what I’m supposed to do”.
Reliability is designed into the product and processes as well.
Perspective is everything. First off, look at your software from the user's perspective. Use it, try it out — fill the consumer's shoes.
Improve your product in a technical sense, but then work on your processes and look into what can be improved there as well.
Always analyze your software, features, threats, bugs, incidents, and issues.
Technically, all the following practices and concepts need to be incorporated in order for us to improve our software:
- Infrastructure Metrics
- Hardware Resources Metrics
- Custom User Metrics
- Real-Time Alerting
- Postmortem Analysis
- Incident Reporting
- Incident Management
Reliable software isn’t built nor proven in a day. Reliability has to be achieved, and the way you design, build and improve your systems all are part of it.
Failure Modes and Effects Analysis
There are many frameworks and models that help you with overcoming issues, analyzing, and improving your solutions.
One of these is Failure Modes and Effects Analysis. FMEA is a systematic and proactive framework to identify possible failures and their impact.
It’s a process for reviewing the components and parts that make up your system, and lets you identify failure causes and effects.
FMEA is a core task for reliability engineering.
Many companies and communities do not have such models in place, yet strive for high quality, stability, and reliability.
It simply lets you recognize:
- What could go wrong
- Why would the failure occur
- What are the impact and consequences of a failure
Glitches, outages, and errors — all-cause downtime and affect how reliable a system is.
Tracking metrics related to incidents is a must. Incident management, detection, diagnosis, resolution, and prevention are all KPIs for your engineering department.
As far as KPIs go, just a few of them are the following:
- Number of alerts created in a given time period
- Number of incidents that occur in a given time period
- Mean Time Between Failures
- Mean Time To Acknowledge
- Mean Time To Detect
- Mean Time To Resolve
Only once you have these in place, you actually can look into what segments can be improved, either per team, product, service, or on your entire engineering department layer.
Whether you have a dedicated support team, or your developers are supporting internal components, you need on-call rotation.
It’s proven to benefit your software quality and reliability.
Having a metric for the time spent handling incidents during on-call also is a very useful way to yield more insight.
Still, all the KPIs, measurements, and metrics in the world can’t replace context. Incidents are unique, and statistics are sometimes traps. Context-aware analysis of an incident is a must and often requires only a pinch of common sense.
Having metrics in order to analyze your components is great, having real-time alerting even before a customer notices an issue is amazing.
Reliability isn’t baked into the software alone, but into the teams behind also.
Alerting and responding to issues is required in order to provide reliability to consumers. Mitigation is key — detecting an issue before it blows up, and your customers even notice it, lets you paint a whole different picture, and enables you to prevent incidents.
Alerting is a must-have. From experience, you’d be surprised how many benefits alerting brings. Try out and see what channels work best for you:
- Real-Time Status Pages
- Slack, Teams, or whatever you use
Many successful companies have all of these in place. Alerting is worthy to be a company-wide initiative together with monitoring.
Design for Six Sigma, or DFSS — is a systematic approach and model in Software Engineering, which goal is to achieve Reliability. It’s a process of generating high and improved quality.
Six Sigma will enable you to have a:
- Statistical Quality Control
- Methodical Approach
- Fact and Data-Based Approach
- Project and Objective-Based Focus
- Customer Focus
All are a prerequisite for reliable software. This allows you in turn to incorporate a DMAIC loop, short for Design-Measure- Analyze-Improve-Control.
There are many concepts, factors, and aspects to Software Reliability, but hopefully the above summarizes a good part of it. The goal of this story is to shed some light on already well-established concepts, but also on some less-known ones. Software reliability is only one segment of engineering, but in essence, it’s a science on its own.
I hope you enjoyed and found the story interesting, whether you are an engineer, architect, manager, or designer. Follow and subscribe to my newsletter to stay tuned in for more stories like this one.
Thank you for reading! 🎉