Gaining and retaining customer trust is very important for SaaS businesses to control churn. There are many factors for gaining or losing customer trust. Software quality is one of those critical factors. Poor software quality, especially regressions to workflows that end users have already been relying on, quickly erode customers’ trust in vendors' ability to meet their needs. Therefore, engineering teams invest significant amounts of time and effort in maintaining suites of automated and manual tests to detect regressions before release. Despite all these processes, customers would routinely report regressions after each update.
At one of my previous companies, we rigorously followed the traditional test and release process for software updates. We wrote extensive functional tests to ensure that future development won’t cause undetected regressions. In one instance, we pushed an update that broke the workflow at one of our important customers due to a severe undetected regression. We promptly launched a fire drill, and a small team of seasoned engineers spent more than a week investigating and reproducing the issue. The root cause was that new feature development introduced a regression in a feature I worked on several months ago. The newer combination of data parameters was not protected by our tests. In this episode of regression, we lost a part of the trust and satisfaction at our customer.
Almost all software regressions are the result of test gaps, often “hidden” because engineers and product managers didn’t anticipate all use cases and parameter combinations. With the current testing processes where test scenarios are handcrafted, it is impossible to protect all aspects of the software that end users truly use.
Further, product usage changes as the product and user base evolves, and it can expose problems in existing code not encountered or anticipated earlier. But, no one ever goes back to proactively add more tests for an old feature.
Net-net, such known and hidden test gaps cannot be avoided with current processes and cause significant losses to businesses[SD2] both in direct costs and through loss of customer trust.
- The cost of bugs found by customers is very high for enterprise software. Engineers must be diverted from their primary focus of creating new business value to investigate and fix customer issues. Often, a good chunk of engineering bandwidth is spent on root-causing and fixing regressions.
- Beyond engineering time, customer success managers spend a lot of time managing customers’ frustration due to regressions. Further, customers themselves have to spend their valuable time to help reproduce issues. All of these result in poor customer experience and leads to erosion of users’ trust.
What is a good solution/process to prevent such ticking time bombs?
The goal of functional regression testing is that existing functionality & usage must be protected as the application evolves with new feature development. Therefore, it is not required to have comprehensive automation tests done before a new feature is released. But, it is very critical to significantly increase its regression test coverage before future development can introduce regressions. Given this requirement, we have two possible approaches, discussed below.
Note that independent of the approach, it is rare for test coverage to be close to 100% because of a) lack of time to enumerate all cases and automate them, and b) the limitation of anticipating data characteristics their application would encounter in production.
The first approach would be to get developers, product managers, and testers to spend a sufficient amount of time imagining all possible user flows, scenarios, and corner cases, and protect them with automated test cases. The goal is to prevent future changes from disrupting any of the previous feature set. However, this approach suffers from the following issues.
- Slower development velocity: This approach slows down development velocity since teams would have to spend a lot more time on testing each new feature before release. Often, many new features would be postponed to subsequent releases because there are just not enough testing resources.
- Cultural resistance: Developers and product managers dislike spending a lot of time on anticipating test scenarios and cases to test. Combined with the trend to get developers to automate all tests, they would have to spend even more time on testing. Hence, this is going to be a significant cultural shift.
The upfront comprehensive testing approach does not fit in the current agile development processes where we launch user features and iteratively improve based on feedback.
Release and Catchup: Increase Test Coverage after Release
A more commonly adopted approach is to release with a limited set of tests and then increase automated test coverage after releasing. This approach has the following issues.
- Resource Contention between Testing New and Old Functionality: Maintaining the discipline to continuously focus a large percentage of engineering effort on reducing the test coverage gaps of prior functionality while there is a lot of pressure to develop and test new functionality is against the typical tendency. Engineers and managers tend to focus on new functionality development and a majority of testing effort is also aligned with newer development and much less to older feature set.
- Increased hidden test gap: Given the resource constraints, product teams often limit themselves to enumerating only a limited set of scenarios. This increases the “hidden” test gap and the risk of regression.
Desired Automatic Backfilling Capability for Catchup
What would be a desired super ability to make the Release and Catchup process work effectively for us?
We require the ability to automatically backfill all test gaps, known and hidden, with zero or very limited engineering resources. This ability would free us of having to make hard choices between focusing precious engineering efforts on increasing coverage for old functionality vs. building new business value.
We will achieve this super capability soon
I sincerely believe that the following trends will drive towards enabling this capability.
- Enterprises are very interested in optimizing their processes with AI applications. Hence, we will see a lot of AI-driven efforts on software testing problems given the importance and massive cost associated with current testing processes.
- Migration of applications to service mesh architectures enable standardized approaches to capture application and service traffic. AI techniques can then learn from the captured data.
- Advances in AI technology to mine large amounts of data efficiently.
We are developing a solution at Mesh Dynamics which enables teams to achieve high velocity and high quality without having to invest significant engineering effort.