Building high-quality software

Published in

Nerd For Tech

8 min readJul 16, 2021

I have interviewed many people (both engineers and managers) lately, and one of the standard questions I ask is how to build high-quality software. Of course, I provide more context and explanations, but the gist is the same.

I heard all kinds of answers. However, I was puzzled that almost none were systematic, and people immediately went into a specific pet peeve.

As part of this exercise, I felt that I had to crystalize my answer to this question and write it down.

Let me start with high-level thoughts (specifically to make it systematic). First of all, I want to concentrate on software code quality (vs. larger topics, including problem definition, documentation, UX, design, etc.).

High-quality software is software that works as expected and has fewer bugs (and a shorter tail of fixing remaining issues). There are a bunch of other things like code readability, maintainability, debugability, and so on which can easily be swept under the quality umbrella. Let’s concentrate on the core that the product operates as expected.

I visualize the software development process as a pipeline going from idea to the product used by customers. There are other ways to imagine it. However, the bottom line is that it goes through multiple steps to get to the usable/delivered product.

And the way to get high quality is reasonably straightforward. We need to catch issues in this process of going through these steps. (Yeap, Captain Obvious is reporting for duty).

I don’t believe that there is one magical approach that will capture all problems. As a result, we need to have defense in depth. We need to have multiple gates through this pipeline which should gradually filter out all issues. The more gates you have, the higher probability that in the end, you will have fewer issues.

The gates that catch problems earlier in the process are better because it is cheaper to fix them earlier. Automated gates are better than manual ones. Blocking Gates that prevent you from moving to the next stage is more efficient than gates that sit on the side. Gates which catch a higher percentage of problems are better too.

Ok. All of the above is a very generic/vague description useful for a theoretical book about quality. However, it is useless without specifics.

Let me get down to the real stuff applicable to most of the bread and butter software companies build.

Must have list

start as early as possible

It’s better to add these gates as early as possible. It’s much better to build your process around quality checks than retrofit these checks into the existent process. NIST did classic research to show that catching bugs at the beginning of the development process could be more than ten times cheaper than if a bug reaches production. If you start catching bugs early, it will save you tons of time fixing them later.

Design review

It’s a very powerful tool when used in a good way. It sits at the very beginning of the process before the code is written and can save an immense amount of time down the road (of somebody spending tons of time just to get to a dead-end). It really helps to talk through the problem, the solution, alternative ideas, corner cases, and so on. I really like what one of the smartest people with whom I worked said: “A good design is a design where you can see the code”. It’s like working with the code without writing it.

Unfortunately, I know multiple very senior engineers who really like to go with “fire, aim, ready” approach. Let’s put together a prototype (even before thinking about different alternatives), let’s call this prototype an alpha version, and fix bugs and limitations in it for the years to come. Saving several hours preparing and doing a design review will cost hundreds (if not thousands of hours) of fixing issues down the road.

unit tests

I don’t believe that I have to say that in 2021, but I have never seen a quality product without unit tests. Period. There are so many benefits. It helps to prove that your code does what it should do, unit tests removes all simple problems. They help to get rid of a lot of flaky behavior and this list goes way beyond catching bugs. Yeah. It’s not a silver bullet, but it can easily catch a very high percentage of all your bugs.

code review

Again, nothing new here. Somebody looking at your code and saying “WTF?” is a great way to see where your code is over complex/brittle/doesn’t handle some scenarios. Important note. As with any non-automatic checks, you get as much from it as you invest (rubber stamping PR won’t add any value).

Monitoring

We (humans) are terrible at imagining all possible permutations of the system with billions and billions of possible states. All of our testing (both unit tests and integration tests cover a tiny sliver of all states). And, unfortunately, the only place where you can see everything that can happen is the production.

It’s incredible how many people entirely ignore it. You may think that you know how the system works. In the best case, you know only how the system was designed to work. Many more complex and subtle problems emerge only in production and could be caught via monitoring/alerting/analysis.

This is probably the newest addition to my list. Like everything else on this list, I had to learn it the hard way. After several outages which could have been prevented by trivial monitoring/alerting/analysis, you start treating your monitoring as a first-class citizen.

manual testing

Yeap. I said it. We live in a time when everybody is irked by manual testing. I tend to agree that you don’t want to spend tons of time doing only manual testing. However, it’s a must-have for most products to work well. Automation testing catches predicted problems but is almost useless for unpredicted issues.

There were so many times when one of the best QA person who I worked with came to me saying something like: “I don’t know. It works, but there is something funky in there”. This sentence is not a binary result of tests, and if it was reported by some automated tool, people would easily ignore it as a false positive. However, as soon as I hear it from this QA person, it raises a huge red flag.

root cause

You don’t need to analyze each tiny bug. However, as soon as you have some severe bugs escaping, you need to figure out whether you need to beef up one of the gates (which should have caught it) or whether you need to introduce additional gates to detect such types of bugs.

Nice to have

the static code analysis tool (and similar tools)

Their efficiency depends a lot on the language and a tool. The beauty of it is that it’s completely automatic and, as a result, very cheap. There are some languages (like C++) where this should be on the must-have list. Other languages may be harder to handle with such tools.

end-to-end (integration) test

Some level of integration tests is helpful to see that your system works as a whole. However, It’s useful as a seasoning for unit tests and not as a main dish.

It’s great to have maybe one or two end-to-end tests for some major features. However, it’s not a unit test. You can’t cover everything, and more importantly, supporting it will cost you, so you don’t even want to try to cover everything.

Anti-Patterns

Excessive manual regression testing

On the one hand, I understand where it’s coming from. As the company’s customer base grows, the impact of bugs becomes more significant. As a result, there is a desire to catch all regression bugs. However, usually excessive regression testing shows a lack of other gates that catches the problems. As a result, there is an overemphasis on the last regression verification.

End-to-ends tests as a replacement for unit tests

As a counter-reaction to manual regression testing, which takes more and more time, companies will try to replace it with excessive automated end-to-end tests. Unfortunately, this especially often happens for a code with poor quality and low unit test coverage. It almost always ends up a costly endeavor (even more expensive than regression testing), resulting in many very fragile tests that are failing left and right.

I saw a company that tried to retrofit quality like that and created a set of 8000 copy/paste end-to-end tests. Last time I heard, about 80% pass and 20% fail on each run. This 20% is somewhat ignored because trying to analyze 1600 failed tests is pretty much impossible. They are rerun(in the best case) and thus defeating the whole purpose of this exercise while spending tons and tons of time/money/energy on this).

Manage quality purely via metrics

Making high-quality products requires a lot of attention to detail (understanding where the problems are, the best way to catch them, where are the strong places and so on). Metrics abstract you away from all details. You can gauge metrics fast, but you can’t (read shouldn’t) make a decision purely based on them.

To be honest, this concentration on metrics boggles my mind. I saw a company spending a nontrivial amount of time gathering all these statistics, asking people to constantly fill out gazillion JIRA fields, google spreadsheets, and so on, just to say at the end, “This component is in good shape, and this one is in bad.” The funny thing is that any SRE working in a company for more than a year could have provided this info in 10 minutes without wasting the time of half of the engineering.

BTW. A side note. As soon as some process (like gathering metrics) becomes a goal (vs. being a tool), you will see more of these time-wasting activities with little or no output.

Summarizing. As you can see, nothing is magical, and very little is unconventional here. However, as I mentioned initially, the thing I see missing in a lot of these discussions is this systematic analysis: defense-in-depth, choosing the proper gates, being retrospective and detail-oriented. And what is even more sobering is that many companies have very few people who have a clear mental model for building high-quality software.

P.S. The list above is obviously not exhaustive, and it’s more of high-level items which could be easily plugged into the development process and can be easily applied to the whole team. There are tons and tons of different practices which can improve quality on a personal level (e.g., TDD, thinking through edge cases, code conciseness, and so on).

P.P.S. If you enjoyed this article, please follow me on Medium or subscribe via email.

Building high-quality software

Must have list

Nice to have

Anti-Patterns

Written by Victor Ronin