Most software products today are extremely complex and testing updates comprehensively before pushing to production is a challenge. They are built with various independent pieces: multiple open source frameworks, service mesh for communication across services built by different teams, and multiple data stores for storing state. Code paths are dependent on the data making it hard to get adequate coverage with handcrafted data sets. The growing complexity of applications has made it challenging to manually create and maintain test cases and test data for adequate coverage. While the software complexity is growing, the industry is moving towards faster release cycles. These two trends are creating new demands on the release process for ensuring quality.
When I was at Google in the Search Ads team, I saw that they would push updates frequently, and never disrupt the operations of that incredibly complex system. In contrast, in another organization I worked at, also with an extremely complex software, I saw serious problems surface in production repeatedly following code pushes. Sometimes the problem would surface a couple of days after the push. It wasn’t that they were pushing untested software — they were just running into corner cases that are nearly impossible to find with traditional testing methods.
Common testing approaches today
Three approaches are commonly used today to validate new releases of software.
Conventional testing: It has served the software industry well for many years. Test cases and data are handcrafted. The goal of each test is well understood. The limitations of the process are:
- Significant human effort is needed to create and maintain tests and test data
- Lack of coverage due to time and cost constraints, and growing complexity of applications
- Test coverage decays over time. As tests start failing over time, usually they are ignored or disabled
The following two approaches were developed in the context of consumer internet applications to address the limitations with handcrafted test cases.
Comparing outcomes for the same traffic: This approach runs the production traffic through both the production system and a test setup in parallel, and compares the outcomes from the two. Twitter’s Diffy is an open source product that provides this capability. This is a resource-intensive approach and I will cover it in a separate article.
Testing during canary: Both the current production version and the new version are deployed in the production environment. Bulk of the live traffic goes to the production version, and a small fraction of the traffic goes to the new version. The outcomes from the new version is observed for some period of time before deciding whether to push the software to a broader set of users. The observations will include looking at the logs for errors, and comparing some metrics that would proxy for functional correctness of the software. For example, you can look at some user metrics and decide whether the new version is behaving close enough to the production version from the users’ perspective. The exact process and metrics should be selected based upon the application.
What is common across these two new testing approaches?
- The test model shifts from being test case oriented to identifying differences between the production version and the new release using production traffic. One approach (Diffy) mirrors the production traffic, and compares the outputs. The other approach (canary) splits the production traffic, and relies on metrics to identify shifts in the user behavior as a result of the changes in the software.
- Both methods shift the source of truth from handcrafted test cases to production traffic.
- Both do away with the conventional notion of test cases in favor of actual usage patterns.
Can canary replace functional testing?
The canary-based approach has several key benefits. It allows the entire application to be exercised with real traffic, and the results can be observed through the logs and appropriate metrics in near-real-time. This can reduce or completely eliminate the need for traditional testing processes.
Canary can almost always be used to test for system level issues. However, for many applications, functional correctness must be established before starting a canary. The scope of canary as a partial or complete replacement for functional testing must be determined on an application by application basis. Some of the factors to consider in the decision process include:
- Volume of traffic: You need sufficient traffic in order to get meaningful results in a relatively short amount of time. You do not want to expose a significant fraction of your traffic to the new code until you have confidence in the new release. Relatively few enterprise applications have enough traffic to do a meaningful canary for establishing functional correctness metrics.
- User tolerance for errors: How sensitive are your users to functional errors in the application? If the users are browsing a news feed, and a few stories go missing as a result of a defect, it would rarely be a problem that the users would notice or care about. However, if your users are sensitive to the correctness of the software functionality, for example looking at financial transactions, then you probably do not want to canary code whose functionality has not been verified already.
- Read-only vs read/write workload: Applications that are primarily read-only functionality are better suited to canary. If the software does a bunch of write operations, and there are defects in the code path, it can corrupt the database, and any rollback will likely involve a lot of corrective steps to fix the data.
- Metrics: Establishing the right metrics to determine if the new version is functionally correct is not simple. Failures are easy to determine in a canary environment using any of the popular log browsing tools. Establishing functional correctness requires developing a set of metrics that can identify potential problems by identifying the differences in user behavior between the production version and the test version. The metrics will be quite different for different applications, and the appropriate analytical approaches must be developed for the application. Google perfected analytical approaches to achieve the predictable rollouts. But not all applications are amenable to this kind of analysis.
Canary was developed to automate the continuous deployment process (CD) — not to address functional testing. The consumer internet giants have successfully leveraged the canary process to test their releases effectively and efficiently. Whether you can use canary as a substitute for conventional functional testing of your software depends on your application. If any of the criteria above are not applicable to your application, traditional testing is still required to ensure functional correctness of the software before it is canaried.
Testing for functional correctness remains a critical requirement for most software release processes. The growing complexity of modern software, coupled with rising quality expectations and the increasing demands for velocity, is posing challenges for the conventional human-intensive testing processes. These are still open problems that need more effective and cost-efficient solutions.
As microservices adoption grows within enterprise apps, and engineering teams adopt CI/CD processes to push out updates at a rapid pace, it becomes important for the enterprise app developers to achieve the efficiencies of software validation developed by the consumer giants. We are developing such a solution at Mesh Dynamics which enables teams to achieve high velocity and high quality without having to invest significant engineering effort.