The Pain of Test Data Creation and Maintenance

4 min readMar 21, 2019

Critical functionality and business logic in many enterprise applications depend on data to appropriately guide the experience of end users. The infrastructure for testing such enterprise applications relies heavily on test databases. Many egregious customer facing bugs (both functionality and performance) surface when test data does not comprehensively simulate scenarios encountered in production. Such bugs when undetected erode customer experience and trust.

When I worked at companies developing enterprise applications, I have on several occasions apologized to our customers and users due to our inability to prevent a serious regression. I have talked to several engineering managers and they also agonize over releasing regressions. Therefore, testing and release process was a heavy lift requiring almost half the engineering effort. Engineering teams had to create comprehensive test databases to validate all possible data scenarios they may encounter. Given the impossibility of achieving comprehensiveness in creating test data, we would still encounter bugs after release and could only hope that they are not very serious.

Common Approaches Today & Challenges

Test database preparation and maintenance are costly — it takes significant human effort. According to a study conducted by IBM, managing, maintaining, and generating test data encompass 30%-60% of the engineers’ time spent on testing. There are two common approaches for generating test data. Currently, software development teams must invest a significant amount of effort with either of the following approaches.

Test data could be manually created. This is the most common approach where engineers handcraft the data to support the test cases appropriately.

Handcrafting test data is time consuming. Engineers must spend a lot of time in order to effectively appropriate datasets for testing. Further, data changes and additions made to support a change in one test can impact other tests in unforeseen ways, and they must iterate through several tests before being done.
Despite the effort, the effort results in limited coverage partly because engineers don’t have complete visibility into all data characteristics seen in production, and into the real usage of the application. Further, engineers don’t have the time and motivation needed to persist with this effort.

Test data could be a snapshot of production data. This is a good approach but has the following challenges.

It is resource-intensive. First, execution of tests is often parallelized to complete within a reasonable amount of time. Hence, we need multiple independent test environments, each with its own copy of the production data. Hosting multiple large production data snapshots for testing is costly. Second, significant DevOps tooling and maintenance is needed to make on-demand instantiation of snapshots for testing.
Even when organizations take the trouble of making production data available for testing, the test strategies are often not aligned to take advantage of the large scale and continuously updated data. Unless the overall test strategy utilizes the changing production data snapshots, engineering teams won’t reap the benefit of using production data snapshots for testing.
Sensitivity/PII data in the production data often prevents engineers from obtaining access to production data snapshots due to privacy / confidentiality constraints. Unless such sensitive data is cleansed and obfuscated, teams won’t even be able to access production data snapshots.

Test Data should be Derived from Production Usage

We should all derive and refresh test data from production usage of an application. Leveraging production usage and data for will significantly improve validation of software leading to higher quality and velocity. Any implementation enabling this approach will however have to satisfy the following requirements.

Resource efficient: The implementation should only consume a fraction of resources needed to host live production database snapshots. Otherwise, hosting multiple copies of production data snapshots to run tests at scale would be extremely costly.
On-demand: Test databases should be available on demand use by any developer to easily run tests on their branch for quick and comprehensive validation.
Scalability: The implementation should scale to be used by many developers and to parallel execution of large numbers of test cases.
Privacy preserving: We should automatically cleanse and obfuscate sensitive/PII data so the organization does not violate any privacy compliance policies such as GDPR, CCPA.
Continuously refreshing: Testing must adapt to changes in application usage so we can protect interruptions in usage and deterioration in customer experience. Therefore, the derived test database should refresh to reflect changes in usage of the application.
Simple tooling: We should not require significant additional DevOps tooling for the implementation and maintenance over time.
Integration with existing tools and workflows: Current developer tools and release pipelines have evolved significantly. It is important that any tools and infrastructure should integrate seamlessly with current SDLC and developer workflows.

Who this is Most Important for

We should derive and refresh test data from production usage of an application. Is this approach only useful for teams that build applications on large scale databases and are releasing new versions at a rapid velocity? Which types of software development teams would benefit from such ability?

At one of my previous companies, we released a new version per quarter. The application was driven by a small (relative to current standards) single tenant database. Even in this context, comprehensive test databases were extremely critical to ensuring robust releases. And, we could have achieved higher quality and higher velocity at a lower cost if we had the ability to derive and refresh test data from production usage of our application. Therefore, we believe that such a strategy and tool should benefit any team developing an enterprise application:

Driven by a rich database
Whose users are not tolerant to errors. That is, the cost of releasing a bug in software is high because it disrupts their end user flows and erodes users’ experience and trust in the development team.

Motivated by this critical need, we are working on a solution at Mesh Dynamics.

The Pain of Test Data Creation and Maintenance

Common Approaches Today & Challenges

Test Data should be Derived from Production Usage

Who this is Most Important for

Written by Venky Ganti