So, you think you do quality assurance? Part 2: Advanced Quality.
Part one of this mini-series, deals with what quality assurance is and how the QA process is intertwined with the SDLC process proactively to minimize the chances of bugs/defects from ever occurring. There is a very strict definition of what a bug is: A bug is a violation of an assumption — code doesn’t behave according to expectations. This difference in expected versus actual behavior doesn’t point to the origin of the problem. In other words, code that is not behaving as expected is being observed, not how such code came to be.
What is quality assurance?
The standard of something when it is compared to other things like it; how good or bad something is; a high standard.
synonym → excellence
a statement that something will certainly be true or will certainly happen, particularly when there has been doubt about it
synonyms → guarantee, promise
In other words “quality assurance” is a “guarantee for excellence.” To have a QA role on the team, means commitment to a high standard. Read more in part one of the series.
Measure all the things!
To quote the late Peter Drucker, “If you can’t measure it, you can’t improve it.” While it is important to fix a defect identified in the system, in the short term, it is also important to understand if it could have been prevented from ever occurring in the first place. The system used and followed to develop software should be put under the same kind of engineering practices and scrutiny as it is expected from its output. This means, that to guarantee quality in a system or a product there is the need to establish metrics and monitor the SDLC process itself. Well-defined checkpoints where assumptions and expectations could be verified and checked for defects (also non-conformities) should be established. All the steps in the SDLC should have well-defined hand-off points that verify the state and the quality of the output before continuing to the next stage. The goal is to catch issues as soon as possible and also minimize the impacts, if any, that escape the verification process. There are some very useful KPIs and metrics we could setup to start measuring the SDLC.
This chapter is going to focus on the metrics used to assess and fine tune a SDLC. Its aim is to answer the questions:
- What are lagging and leading indicators?
- What metrics are used to evaluate the whole process?
- How cyclomatic complexity, code coverage, code churn and defect density, and others could be used to fine tune and improve your process?
Let’s start by making a clear distinction between KPIs measuring quality of a particular product or item, and KPIs measuring a process. The definition of a quality product depends on many factors, one of which is a high quality process. As already discussed in Part 1, quality assurance is about process and is the responsibility of the entire team!
Before moving on, a quick word on types of indicators!
A good assessment of a process requires us to measure a combination of lagging indicators (looking back) and leading indicators (forward looking). А simple example of such indicators, is the estimation of how much fuel is needed for going from point A to B. The max speed, terrain and weather conditions have an effect on how much fuel is needed for the trip and while monitoring the situation with the leading indicators such as current fuel consumption, A/C on, windows down, traffic, etc. Total fuel consumption will only be known once getting to point B which makes it the lagging indicator in this example. Choosing good leading indicators gives the ability to project the results of the lagging indicators with sufficient accuracy. To measure the effectiveness of a SDLC process, proper indicators with clear understanding of how they correlate to each other should be chosen.
In a typical software development scenario, a leading indicator is something we can observe at any point in time. Code coverage reports are such a thing. We may pull one out at any point in time and have an accurate number of code coverage for the current state of the system. However, this indicator on its own is a very poor predictor of the quality of a software system. Quality of the system is indicated by lagging indicators such as escaped bugs and the number of production incidents that can only be observed after the system is live. Code coverage can become useful once it is correlated with escaped bugs and other issues in a way where we can observe how changes in the leading indicators affect the lagging ones.
Measure the SDLC Process
Why the need to measure and re-evaluate the process constantly (even in cases of successful product launch)? The answer is both very simple and yet complicated. For a business to be sustainable, it must be efficient. Everything done must make economic sense. Sometimes things that happen along the way are easier to digest in hindsight. The fact that the product was completed on time and within budget doesn’t mean it was done optimally. Maybe it could have been done cheaper or faster or with less defects. Wastes could’ve been accumulated along the way that might have been avoided, thus the next product could be delivered with less resources or in less time. As an example, delivering a product with 3 major bugs, or 0 major but 10 minor — is the tradeoff worth it? The process and the output must be measured with a variety of KPIs to be able to make such decisions and keep the business efficient.
The following is a non-exhaustive list of metrics and KPIs to get you started:
Lead time and total wait time
Lead time is the latency between the initiation and completion of a process. As a simple example, the lead time of a software project is measured from the first line of text written about the idea until it’s delivered to the end users. On the other hand, total wait time is how much time the project was stalled based on a variety of reasons: resource availability, waiting for third parties, etc. Calculating these times could be almost entirely automated with modern tooling (at DraftKings we use the Atlassian suite).
Cycle time is a lagging indicator that, in contrast to lead time, only tracks the time a team spends actively working on a project. It is used to identify the average development times for a team or a given request type. This assumes some kind of categorization of the projects/tasks so they could be compared over time. A big thing to note here is that work is not always constant and comparable in terms of absolute numbers. Cycle times are a sequence of data points that help us identify trends.
Achieving maximum productivity with minimum wasted effort or expense is crucial for a business in a competitive landscape. If you want to be highly efficient, you must do things once and only once! Defects that force you to go back and forth on the process for re-works incur additional costs that could damage the long-term ability of a company to invest in capital expenditures, in addition to slowing the time to market (TTM).
These three: lead time, cycle time, and total wait time (sum of all the wait times in the process) give the understanding of the efficiency of the SDLC. The design, development, test, and rollout stages could be influenced by applying a variety of metrics and quality gates. Few examples:
- involving an architect and QA in the requirements gathering process could shorten all the following phases of the development
- proper architecture and testing strategy reviews could increase the design cycle but could also significantly shorten the development and testing ones
- more vigorous quality gates in the development phase could result in longer cycle time, but could also shorten the test and rollout phases
Finding the right balance can be tricky, which is why both leading and lagging indicators are measured and correlated.
Number of production incidents
Not to be confused with escaped bugs, which could be a minor color difference between buttons on the page which would not impair the users’ ability to use the system. Incidents are unplanned interruption of a service. They are categorized by their severity. As an example, a bug could cause one or more incidents which would give it a high priority for a fix (even trigger a rollback to a previous known good state of the system). Severity is a measure of the potential impact to the system and it influences the priority assigned to eliminating the root cause. Bugs could be escalated to incidents and incidents could be downgraded to bugs.
Number of production incidents tracks imperfections or abnormalities that impair the quality, function, or utility of our product. Metrics, such as the cost of fixing bugs in production, mean time between failures, mean time to recovery, and others, are related to this number. It also helps to estimate the cost of lost business and should trigger a root cause analysis process of how we delivered a defective product to our customers. This is one of the cases where code bugs or system misconfiguration is just a symptom of the real problem, which could be a conflicting requirement, design that wasn’t validated, code or test review that was skipped, missing handover step between teams, faulty quality control, etc.
The “time frame” of measurement for number of production incidents could be defined based on the business and release cycle. Some examples: per release, per quarter, per year.
Mean time to recovery (MTTR)
How fast can а system get back to working capacity once a failure occurs? This includes the full time of the system going down to the time it becomes fully operational again. It is calculated by aggregating all the downtime in a specific period and dividing it by the number of production incidents. If a system was down for a total of 60 minutes over the past year, and there are 4 incidents for the same period → 60 divided by 4 gives us 15, so the MTTR is 15 minutes for the past year. There is an important distinction to be made though → one 60 minute incident and 60 one minute incidents, spread over one year, could have completely different operational outcomes for a business.
This is a measure of the processes (detection, investigation, recovery, cost, severity):
- Is it as fast as it should be?
- How do you fare against your competitors?
- Are you efficient?
- How much does it cost to fix an incident? How much business did you lose?
- How affected was the system?
This metric could be used to identify if a problem exists, however, pinpointing the problem is a different subject:
- Alert system is measuring only averages? Maybe there is a need for p95/99 as well?
- Team is slow to respond → If so, why? Are they missing a checklist or lacking in skill or knowledge of the problem domain?
- Fixes are taking too long. Is there proper automation and code coverage?
- Is there enough data to find the issue?
Downtime measurements should be split into measuring the different phases too: identification, correction, rollout, prevention.
Number of escaped bugs
An escaped bug or defect is such that it was not prevented or found by the quality assurance process but by our customers (or even worse, abusers). The variation of how this should be measured is important, but it could be highly affected by the business niche and the way a company builds their software. It could be measured over time, or per release, even per KLOC (or a combination of all). This metric clearly states to the teams that the company cares about quality and tracks it in variety of ways. Less escaped bugs are directly correlated to QA process efficacy.
As already covered in part one, code bugs are symptoms of a problem. While all bugs that were found should be categorized and analyzed, with escaped bugs that is mandatory. They must also be sorted into different categories. A good start would be to split them into: missing requirement; missing UI design; missing architecture; technical debt. This simple categorization of the bugs gives a team the ability to decide where to focus.
To estimate what really went wrong there needs to be a root cause analysis about how and why faulty code got developed and released:
- Requirements review
- Software design document review
- Test plan review
- Implementation review
- Quality control procedures review
The output of the RCA should guide you to what procedures should be enhanced (or were not followed, in which case → “why”) to prevent future occurrences of non-conformities.
Escaped bugs rate is another metric that should be tracked. It is expressed as a percentage based on how many the team found and prevented versus how many escaped.
Cost of fixing bugs in production
There is a lot of nuance in this one! This metric can’t be estimated with high precision as different defects require different solutions and there is a high probability they affect customers disproportionally. However, every organization must be aware of bug cost and track it over time as to establish a trend line. Any increase could indicate issues with the SDLC process and must be investigated.
The cost of fixing bugs in production is non-linear to the size of the problem. If it takes $10 to identify and fix conflicting requirements, it would take $100 to identify during design, and $1000 to identify and fix during development, it would take $10,000 to fix if found in the test phase, and a $100,000 if found in production.
What could be the components of cost for fixing bugs in production?
When a fatal flaw is affecting the ability of the end users to access services, the business is paying a very high price in brand damage, maybe even penalties in some niches. The team needs to react quickly to get back in business, which is never the most effective and efficient way to solve any issue. Even after the system is patched up to resume it’s operation, the team still needs to do a RCA and fix the problem where it actually occurred. During that time, the company isn’t adding new value to their product as the team is busy keeping the system afloat, and the competitors get the chance to catch up or even surpass it. Large companies tend to have engineers whose sole responsibility is reliability (SREs) and problem management (PMGs).
A few examples:
- A misinterpreted regulatory requirement could have implications on the entire system’s design and topology. Discovering that when it is supposed to go live with the system could damage the company’s reputation or even get their license to operate withdrawn.
- A missed requirement could render an entire feature unusable and discovering this in the test or rollout phases of the SDLC means hundreds to thousands of man hours wasted.
- A mistake in the design could overlook that users’ requests could traverse the entire globe before returning back to the user, thus rendering the feature useless due to high delays. Case in point: data related to trading platforms.
- A bug in the code could lead to the discovery of a whole set of edge cases that got overlooked in the requirements phase which might even put the project on hold and back to the drawing board.
Let’s say the cost of fixing a bug in production is $50,000 USD. What is included in that number?
- Lost profit
- PR damage
- Future penalties
- R&D to fix
- Lost opportunities
So, is that number good or bad? Is this the entire company’s margin for the year? The answer always depends on the business itself.
Bug density (a.k.a. defect density, not to be confused with escaped bugs) is a metric that is tracked over time and it is a lagging indicator. Historically this has been calculated based on KLOC, which is proving to be a challenge in distributed systems. In modern distributed systems, the majority of bugs are a result of interaction between services. They could be small and well tested but their integration with other services is where things become tricky.
Based on the business, architecture, development methodology and SDLC this metric could be defined in terms of:
- How many of the change requests are legitimate changes versus how many are for defects.
- Bugs per sprint
- Bugs per feature
- Bugs per cycle
By analyzing bugs and bug-density increase/decrease, one could improve a variety of steps in the SDLC process:
- definition of done
- test cases
- testing process
It is important to note that both an increase and decrease of the defect density should be investigated. Good practices should be extracted and introduced where applicable.
One may start by applying tags to bugs and tasks opened during a sprint such as (but not limited to):
- missing UI/UX design
- missing architecture
- missing requirements
- missing or poorly defined SLAs
- technical debt
Besides presenting a clearer picture on why a defect was reported, it is going to help in pinpointing issues with the process.
Bug density should be correlated to: cyclomatic complexity, code churn, code coverage.
Mean time to fix red build
Red builds happen (could be compilation error, failing test, or something else on the CI/CD pipe), that’s part of the development cycle. There is nothing inherently bad about breaking the build or the main branch. If it doesn’t happen there is probably a top secret main branch for integration. The need to measure this is not because it shouldn’t happen, but because of the need to understand how often it happens and how fast the team could resolve such issues. Reason for the broken build should also be tracked. This metric is no different than “mean time to recovery”, but it is focused on the build pipeline and not on the production software. If the team knows that they are being measured based on how fast they identify and fix broken builds, they are going to be much more open to automation and CI/CD, including more vigorous standards for code quality and code coverage.
Code coverage is a leading indicator and, as such, it must be tracked together with a lagging one. It could be an important metric that shows if all code paths can be taken via the public interface of our entities (class/function/object/package/library/etc.). Engineering should strive towards 100% code coverage via integration/e2e tests, meaning all code paths are taken via calling the public interfaces and based on the product requirements.
What should be done if there isn’t 100% coverage?
- Are there test cases for all business requirements?
- Are there test cases for all use, abuse, and edge cases?
If the answers to the above questions are both “yes” and there isn’t 100% code coverage it could mean the code does more than it should.
When coverage is lacking the first thing to understand is if code can be removed:
- Can’t remove it → is this implementation specific or it is resulting from an attempt to combine use-cases not predicted by the business requirements.
- if it is implementation specific: refactor to remove.
- if it is not implementation specific: ask the product team to specify behavior in that situation and if this is valid business case or not, then amend the implementation to comply with requirements (and add tests).
It is possible that there is a case where “implementation specific” detail couldn’t be refactored, which means that new cases are opened for the product team to consider and provide feedback. Maybe they are valid for the business or maybe not.
Code coverage could be correlated with (but not limited to): escaped bugs, lead time, cycle time, number of production bugs, mean time to recovery, cost of fixing bugs in production.
This is one of the most important leading indicators for the complexity of software. High complexity leads to unpredictable results and code behavior. Complex code is harder to cover with tests, as the amount of test cases required are growing together with complexity. This quantitative measure of the number of linearly independent paths through a program’s source code was developed by Thomas J. McCabe, Sr. in 1976 .
One way to test by using this complexity estimate is basis path testing. Compared to code size in KLOC, code complexity is a better predictor for: quality of the codebase, cost of change, and time required to implement new requirements. While not always directly related, code complexity may also indicate poor design.
This metric could be a quality gate enforced by the CI/CD process with a specific acceptable complexity per function/method/codebase/etc. Since complexity correlates to unpredictability, there should be an on-going goal to reduce it.
Cyclomatic complexity should be correlated with: lead time, cycle time, escaped bugs, mean time to recovery, mean time to fix red build, and cost of fixing bugs in production.
Code that is frequently changing should be re-evaluated. Code churn is a lagging indicator. High frequency of change could point towards mixed presentation, business, and data access logic; or too many responsibilities (SRP) on the same source code, or a violation of the open-close principle (OCP).
How often is too often? If code is re-written within one release cycle since when it was merged (excluding planned extensions) it should probably be re-evaluated. This metric should be presented as a trend rather than a concrete number to look after. If it goes up → why did it go up? If it goes down → why did it do that?
Code churn in itself isn’t bad. Software applications are changing and evolving all the time. Reworking to accommodate new requirements is a valid case for changing the code.
What could code churn mean for the system?
- too much prototyping and POC
- perfectionism: code is working and performing within SLA, but developers are still not happy with how it looks or how clean it is
- badly designed
- violation of open-close principle
- violation of the single responsibility principle
- too much responsibility put on the same piece of software
- many teams need to make adjustments to meet their goals
Summary on metrics and KPIs
Metrics could be used to influence behavior, thus choosing the correct combination of them is very important. A bad example would be for a software company to solely measure code coverage. This would inevitably lead to having very high (sometimes even 100%) code coverage. Is that real coverage? Code coverage is a leading indicator and as such it must be paired with the correct lagging indicators to be useful. Corelating code coverage with escaped bugs and lead time could give insight on the efficiency of the tests that are being performed to cover the code, plus what kind of impact it has on the delivery.
People adjust based on how they are being measured. If development is measured in terms of lines of code or added new features those things would be added ad nauseam just to meet the definition of development. To what end though? Complexity is what kills software systems. As such it needs to be measured and managed.
For every behavior that you see in a team, the environment around them is perfectly designed for that behavior to happen in. To change the output, one needs to fine tune the measurements. To continue with the code coverage example, if people are incentivized to find bugs there will be more opportunities to increase code coverage rather than solidifying buggy code with tests designed to “cover it” rather than to test it.
All measurements must be based on a “why” — the desire to change X and metrics can help follow through.
Lets address a simple conflict of interest software companies have:
- What is the best way to mitigate failures on production? → once everything is stable → don’t deploy anymore.
- What organizations want to do? → deploy new content/features as fast as possible to stay ahead of the competition.
Often people are collecting metrics and analyzing them, but failing to act on them. It is in human nature to be passive and act only in reaction to something.
In the previous section, there is a list of metrics that aid the measurement of the process. Here, there are a few important quality gates and how they could help influence the team and process to produce better results with less bugs. Quality gates are not a panacea and one would need to implement and fine tune them in accordance to their own reality. They are not ordered in any particular way and are mostly about the design and development phases of the SDLC. This is by no means an exhaustive list!
This review is related to the quality of the output of the SDLC. Architecture review is important as at that stage there is still time to discover and fix problems with the product at a relatively low cost compared to discovering them in the later phases.
Before any such review could really take place, there must be a software design document (SDD ). To get consistently good results the architecture must be documented. The documentation itself should include (but not be limited to):
- project overview
- assumptions, sequences of execution
- services and interfaces, detailed designs where applicable
- scenarios and interactions
- NFRs: concurrency, testability, maintainability, extensibility, security, scaling, etc.
- assembly instructions — how to construct it and get it live, rollout and rollback plans
- POCs, sample code, experiments
Otherwise there would be nothing to review. Random pictures of boxes with arrows between them are not architecture! Moreover once the architecture is reviewed, it needs to be presented to and signed-off by the team who would be building it. If the team doesn’t have the necessary skills and competences to execute it (Ruby on Rails solution for a .NET team) there are two main paths to take:
- Plan and execute training for the team to align with the technological needs of the project.
- Re-work and align the architecture with the teams who are going to build it.
Either one or a combination of both could be a valid outcome in different business niches.
After the development is done, there should be a step to verify if the design was followed correctly and if not, find the answer of the question “why?” Then, fine tune the process to avoid future discrepancies.
Effective code reviews are hard and one needs to establish a proper process of doing them!
- Keep reviews short. Send PRs for less than 500 LOC.
- Take your time when doing a review, read the requirements before reading the source. Review the test plan too.
- Try to fit the review within 1 hour. If it takes longer than that, it should probably be done together with the author. If the changes are so complex it is worth it to consider refactoring.
- Authors should provide guidance and annotate the source code before the review. The authors’ reasoning on the decisions they made is important to conducting the review.
- Use checklists as a template for the code review. Define checklists for a variety of cases: security, quality, complexity, etc.
- Reviews are tough; keep a positive tone, don’t nitpick (there is a linter for that), if you see something you’d want to avoid, coordinate an update to the style guide with your team.
- Try to do on-the-fly code reviews during pair programming sessions or code walkthroughs.
Reviewing code is a process itself and, as such, it must be measured. Some useful metrics to gather: inspection rate, defect rate, bug density.
Formal peer reviews are very important in cases of escaped bugs. In such cases, the review must be expanded to include:
- requirements review
- test plan review
- design review
- code review
The code review should not be performed for things that could be enforced automatically such as a style guide. In fact, strict and comprehensive coding standards are a necessity and should be enforced by the CI/CD process.
During a review, tests should be verified as relevant and not blindly covering code. On the high level, a test should be:
- connected to a business requirement
- clear on the impact of the broken code
- one failing test = one business problem
Unit tests as tools to reach high code coverage should be avoided as they often test implementation rather than business requirements. In such cases, it is very hard to understand if there is a false negative or a false positive result. If a team is measured on how many tests they write, they will write a lot of tests for the sake of writing tests. Tests created just to satisfy a code coverage metric are, at best, solidifying existing code implementation. It is advised to provide integration and/or end-to-end tests where applicable, that would treat the different components of the system as black boxes and verify the expected behavior and communication patterns between them.
Unit testing still has its uses when something needs to be validated in isolation. The goal of the test reviews is to make sure the tests are relevant and high quality.
Tests should cover known desired behavior and known undesired behavior too:
- use cases
- edge cases
- abuse cases
Things that shouldn’t be tested (exceptions are possible of course):
- Third party libraries
- OS level I/O
- Automatically generated code (example: bundling for the web, tsc emit)
Functionality should be tested and if that functionality is handled by a third-party there should be a layer of integration tests for this aspect of the system.
Guaranteeing quality is a tough job! It is everybody’s job. One can’t let others take care of quality on their behalf. A zero tolerance to bugs is required as a mentality in the whole organization to make this work. Zero tolerance doesn’t mean bugs are not happening. It means that when expectations are not met the team won’t move forward before they understand where the system broke and why. Otherwise, there could be no assurance that the reason bugs are observed now is not going to produce more problems in the future.
And don’t forget the daily smoke test!
You either own the quality or the lack of quality will own you!
-  https://en.wikipedia.org/wiki/Cyclomatic_complexity
-  https://www.amazon.com/Change-Anything-Science-Personal-Success/dp/0446573906
-  SDD Wiki,  SDD IEEE Standard
- Righting Software (Book)
- McCabe’s Cyclomatic complexity and Structured Testing: A Testing Methodology
Using the Cyclomatic Complexity Metric
- TPS handbook
- Lagging and leading indicators
- Economic indicators
- Lead time
- MTBF, MTTR, MTTA, and MTTF
- Extreme programming explained
- Code quality standards
- Software quality
- Framework for managing system of systems