All about test data

Team Merlin
Government Digital Services, Singapore
6 min readApr 14, 2023

Test data is the necessary data used to test software systems and it is like the ingredients to a sumptuous meal — high quality ones improve the standards of the end product.

Image Credit: Unsplash

Having the right test data in software testing is an integral part of software quality assurance processes, ensuring the accuracy, reliability, and consistency of software products. During the testing process, relevant test data needs to be available to simulate different test scenarios effectively.

Quality Test Data

Using poor quality test data can lead to inaccurate test results, which can cause critical defects and financial/reputational losses for the organisation. Thus, it is essential to ensure the test data used is of high quality to achieve the correct test results. Note that there are two broad groups of test data — test data that exists in the application and test data to be used for test case inputs. Depending on the nature of the system, various combinations of both types are usually needed.

Now, you may wonder how to determine if the test data is of good quality.

Test data should be accurate and representative of real-world scenarios. As the objective of the test data is to simulate real-life behaviours in the test environment, it’s important for it to be as realistic as possible. For instance in the case of a user registration page, address and date of registration are collected. If we only use a single address and date for all our tests, it is unrealistic and in real-life, issues may spawn from the interactions of data with different addresses and registration dates.

Test data should be up-to-date, reflecting the current state of the system being tested so that test cases utilising it are relevant at the point of testing. Outdated data with missing/empty newer fields may cause the test to miss out on bugs related to the fields.

In addition, test data should be of sufficient volume to ensure that all system functionalities are tested thoroughly. Whether the test data volume is sufficient greatly depends on the complexity of the system, which may require more test data to cover multiple flows. In general, the test data should be able to cover all the identified test case scenarios in one execution cycle. Real-world data is vast and consists of all kinds of combinations and edge cases, so similarly, test data should attempt to cover as many permutations as possible, subject to the complexity, risk requirements and timeline of the application under test.

Generating Test Data in the System

There are a few ways of generating existing test data, including manual, automation, and importing data from the production environment. Depending on the testing needs, the method is chosen accordingly. For a completely new system without any data yet, it is recommended to identify and generate a base set of test data to cover the various flows and types that exist for the system.

Manual data generation involves manually creating test data directly in the application under test, which is time-consuming and prone to errors in general. This can be used for smaller, less complex systems which do not need much test data during testing. Also, it can be done quickly to simulate certain edge cases for investigation.

Automation involves using automated test data generation tools to create test data without any UI interactions (i.e. test data is input to the system via test automation scripts). More efficiently, test data can be directly created, restored, or seeded in the test environment database to the state whereby the test environment is ready for testing. To include randomised values automatically into the test data, libraries such as fakerjs can be used. This method is also suitable for a completely new system that doesn’t have any data yet to quickly populate with test data.

There’s a limit to the test data that we can create on our own. Sometimes if we need production-equivalent test data in terms of diversity and volume, importing data from the production environment is one possible approach. It involves copying real-world data into a test environment for testing and depending on your needs, it can be the full production dataset or just a subset. A major consideration for this method is to cater extra effort to mask the sensitive real-world data in order to protect them. Another caveat is that the system is not completely new and therefore, doesn’t have any live data yet.

Managing Test Data

During the test planning phase, it is recommended to define test data requirements by identifying the test data needed. With this information, a suitable test data generation method can be chosen and incorporated into the testing cycle.

Automated test data generation steps can be integrated with DevOps processes to generate test data on-demand or refresh test data when needed. Test data can be set to be generated automatically before the automated test suite is run, or via a trigger whenever test execution is needed.

Periodic reviews should ideally be scheduled to ensure test data for new features is made available and outdated data are replaced/deprecated accordingly. There’s no hard rule on the review period as each project is unique, but it’s generally a good idea to revisit the test data used and needed whenever a major change or new feature is implemented, or minimally every year to ensure no new surprises.

Keeping the required test dataset as lean as possible can lower the maintenance effort and cost. The key here is to balance the scenarios covered and test coverage so that there’s sufficient but yet not overwhelming amount of test data. Moreover, the test environment is usually sized lower than production to save cost, and may have issues handling large amount of test data especially in situations whereby the data accumulates over time in the absence of maintenance.

It is a good idea to ring fence test data for different test cases to allow independent, parallel test runs that are consistent. This prevents various test cases or testers from changing the test data assigned to specific tests, causing the intended test scenarios to be affected. Also, portions of the test data can be identified and segregated to be generated on a need basis for tests that focus on specific modules of the application. In this way, time and effort is saved by not regenerating the full set of test data when only a smaller subset is required.

Test data is a crucial part of ensuring quality in a development team by enabling complete test coverage of various test scenarios. Therefore, it is important for the team to plan and implement a process for adequate test data to be available whenever there is a need to execute test cases. WIth a more structured approach in managing test data, testing will certainly be smoother and more effective.

We hope the above information is insightful to you. Every project is different and the suggestions are certainly non-exhaustive and never a one-size-fits-all solution. Do reach out and share with us the approaches that work in your project. 😊

🧙🏼‍♀Team Merlin 💛
Application security is not any individual’s problem but a shared responsibility.

--

--