If All You Know Is Testing, Every Bug Is a Test Gap

Heemeng Foo
The Startup
Published in
9 min readJan 25, 2021

A test professional’s take on “if all you have is a hammer, every problem looks like a nail”

Photo by Barn Images on Unsplash

What made me write this

One of the things I do at work is attend a weekly engineering post-mortem for high severity production issues that have occurred to lend my experience in looking at whether there were gaps in testing that could have prevented the issue. As any experienced engineer knows, not every production issue can be avoided “if we just had that test at pre-deploy” or is it always cost effective to implement that test (by cost, I include opportunity cost of delaying the release of the feature). At one such session, the team was struggling to find the test gap and I stepped in to explain that the gap was not with testing but monitoring.

That incident brought me back to all the times that, as a Test/QA/QE Manager I’ve had to confront this situation: in the past, when a bug showed up in production, I have been asked “where was the test gap?”. While well intention-ed, this leads to the test team adopting a defensive posture: OK, so there was an “escapee”, that means the set of test cases is incomplete, that means we have to add it in.

The problems with this are the following:

  • The reality is that the process of developing test cases is incomplete and fundamentally flawed. (Not that it is not useful. Lots of tools we use eg. CPI/GDP are flawed measures but nevertheless useful). I explain this in “Code reviews and unit tests: the bedrock of software quality” [1]. In a nutshell, there is no guarantee your test cases are complete because there is no guarantee that the requirements that the test cases are based on are complete.
  • This defensive posture and subsequent adding of new test cases leads to “Test Case Hoarding”. In “Why test suites fail — and how to avoid breakdowns”, Erik Fogg writes about what can happen when this strategy is taken to its logical conclusion: (a) test suites whose test results have so many false positives they become meaningless (b) the pile of test cases become so large nobody wants to update it, hence it becomes less and less useful as a source of truth over time.

The root cause of this approach is the lack of awareness of other aspects of engineering and which is best at solving different problems. In “What every test engineer needs to know about Observability” [3], I related my own personal experience of how an engineering team had used telemetry signals and CI/CD to test and validate fixes and how that had impressed me.

Hence the point of this article: I hope that as test engineers, you will expand your horizons to learn more about other aspects of engineering in order for us to achieve better software quality. We need to enlarge our toolbox, or at least have a decent understanding of these tools, so that we are more able to use the most appropriate for each task.

“What? We have to learn so much? We already have exploratory testing, load testing, performance testing, mobile testing, security testing, testing-in-production etc to learn and you’re advocating we pick up dev and devops skills as well?”. Well yes, but to be fair I also advocate that developers also pick up testing (see [4]). Also as I described in [3], cloud based SaSS tools are getting easier and easier to configure and use and there are a ton of resources online eg. Stack Overflow and Medium, that you can quickly learn how to work with these capabilities.

In this article, I list down what some of the other skills I think are useful for all test engineers to pick up and very briefly explain why. These are:

  1. Development
  2. System architecture and design
  3. Design for testability
  4. Observability
  5. Synthetic monitoring
  6. Alerting
  7. Chaos engineering

Development

I have always been a strong advocate for the following: (a) test engineers are engineers first with a specialization in test (b) if you’re going to test something you need to know how to build it first. In [4] I echo these same principles. By understanding how the product is built and the people and processes building it, we are better able to see around the corners and help advise our teams on potential issues that may crop up. Indeed I encourage test engineers to take a sprint or two to work on a small feature to get a sense of how the dev team works. It will be very enlightening.

System architecture and design

As a test engineer you should have a decent grounding on what is sound system architecture and design. Some examples include: client-server architecture, microservices architecture, MVC (Model-View-Controller) architecture etc. You should also have a healthy interest into how common systems are designed and built eg. a simple tinyURL system, a mobile app, a social media application like Facebook or even a file sharing service like Dropbox. Also what are the current technologies used to build them eg. NoSQL databases, RDBMS, containers, caching systems, object stores and serverless ie. lambda technologies.

The rationale for this is the same as for Development. One of the biggest values you bring to the team as a test engineer is your ability to ask the question “what could possibly go wrong?”. By understanding the architecture and overall system design you will then be well equipped to ask the right questions and that leads to better risk identification.

In this age of the cloud, the barrier to entry is really low as (a) you can get an account on either AWS, GCP or Azure and you can fiddle around with these services (but watch your running cost!) (b) most cloud providers invest heavily on sandboxes and self-paced tutorials with hands on labs for you to pick up the key aspects of their offerings.

Design for testability

You would think that with the root word “test” in this topic that test professionals would be experts in this but no. In my years in the QA/QE/Test space I have hardly heard this being mentioned. In essence, design for testability means that each component in the system is testable at its most basic level and each level of composition of these components are in themselves testable. This calls for great design so that components are highly cohesive and have low coupling with other components.

One good example I can think of (and to my horror a pretty common occurrence) is how Web or Mobile clients are built. The very typical anti-pattern is where the business logic is tightly coupled to the UI code. In iOS that would mean business logic is spread across ViewControllers (or Activities in Android). This makes it really difficult to write stable test code because you would have to test functionality with the UI and that leads to flaky tests. The better way would be to abstract out the business logic so that that component is testable on its own and you can build a mock of that for testing the UI separately.

Observability

In a nutshell, Observability is the ability to infer about the inner workings of a system by looking at its outputs. Typically this is via monitoring tools and logs.

As I explained in the earlier article “What every test engineer needs to know about Observability” [3], there is no way we are able to test every single code path and data combination. It is just combinatorial madness. Once our code is in the wild ie. in production, the user will be able to uncover flows and data combinations through our systems that will cause problems. Hence, rather than contorting the brains of the test team to come up with extremely contrived test scenarios, it is far more efficient to add monitoring and logging to detect such scenarios when they happen and fix the issue accordingly.

Synthetic Monitoring

This is totally up any test engineer’s alley. These are essentially scripts that mimic a user path through a system and is exactly like end-to-end testing. Here instead of running the script for a test pass, it is run every few hours or even minutes in order to reduce the MTTD (Mean Time To Detect) of a failure. Most of the APM and Observability tool-sets include this feature eg. Dynatrace, New Relic, AppDynamics. Because it is run so frequently, special care must be taken to ensure the scripts are not flaky and create false positives.

Alerting

These are systems that provide integration with your APM or monitoring and logging systems to send out alerts based on rules and thresholds. Most also provide ways to schedule your on-call rotation. Probably the most popular of these tools is PagerDuty but there are also paid alternatives eg. AlertOps and ZenDuty and free/open-sources ones such as Cabot and OpenDuty.

Chaos engineering

In [6], I briefly mentioned this term: “Dark Debt”. It shows itself when, due to business needs, systems get very complex (sound familiar?) and the interactions expose defects that do not show up in the individual components or the design. As John Allspaw explains it in [7],

“The challenge of dark debt is a difficult one. Because it exists mainly in interactions between pieces of the complex system, it cannot be appreciated by examination of those pieces. After anomalies have revealed the relationships they appear obvious but the appearance is mainly hindsight bias.”

In other words, you only know of its existence when it shows up, often in catastrophic ways. Have we seen this somewhere else before? Well yes, in warfare. War is chaotic, warfare is complex with many parameters, many of which are unknown. How do armed forces around the world deal with this complexity and expose problems in their SOPs, organizational structure, weapon systems or leadership? They frequently conduct exercises, or simulated warfare, the more realistic, the better. It is never fool proof and nothing beats a determined enemy to force you to change your approach (see [8]). However, it is the best that we have.

Similarly for complex systems, the way to expose such hidden problems is to subject the system to simulated but less than usual circumstances. This is where Chaos Engineering comes in. [5] describes it as:

“Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

By introducing additional load, latencies, aberrant payloads etc we push the system to expose flaws in production (not on some sandbox). A lot of folks balk at doing this in a production environment but it is exactly a production environment that needs to be exercised. Some ways to address this include setting aside a off-peak preventive maintenance window and performing them on a “blue” environment (as in blue/green deployment, see [9]).

As for tools, Gremlin has a very good list found here.

Conclusion

In the beginning of this article I mentioned the age old adage: if all you have is a hammer, everything looks like a nail. I sincerely hope that as test engineers, we do not limit ourselves to just that domain but also engineering as a whole. I hope this article will encourage you to take the time to pick up new technologies and skills to be better at what you do.

References

[1] Code reviews and unit tests: the bedrock of software quality, Heemeng Foo, Sept 2020, Medium, https://medium.com/dev-genius/unit-tests-and-code-reviews-the-bedrock-of-software-quality-9a23cd24558b

[2] Why Test Suites Fail — and how to avoid breakdowns, Erik Fogg, Apr 2020, ProdPerfect blog, https://prodperfect.com/blog/end-to-end-testing/why-test-suites-fail-and-how-to-avoid-breakdowns/

[3] What every test engineer needs to know about Observability, Heemeng Foo, Dec 2020, Medium, https://medium.com/swlh/what-every-test-engineer-needs-to-know-about-observability-654f757f6622

[4] Who should do software testing? Dev or Test?, Heemeng Foo, Jun 2020, Medium, https://medium.com/dev-genius/who-should-do-software-testing-dev-or-test-41c7ea39ee83

[5] Principles of Chaos Engineering, https://principlesofchaos.org/

[6] You can’t fix quality just by catching bugs, Heemeng Foo, Oct 2020, Medium, https://medium.com/swlh/you-cant-fix-quality-just-by-catching-bugs-ddc01d900474

[7] Dark Debt, John Allspaw, Nov 2018, Medium, https://medium.com/@allspaw/dark-debt-a508adb848dc

[8] Team of teams: new rules of engagement for a complex world, Gen. Stanley McChrystal et al, May 2015, Portfolio

[9] Blue/Green Deployment, Martin Fowler, Mar 2010, martinfowler.com, https://martinfowler.com/bliki/BlueGreenDeployment.html

--

--

Heemeng Foo
The Startup

Test automation, Engineering management, Data Science / ML / AI Enthusiast