Test Data Management

Published in

Globant

6 min readSep 11, 2020

Introduction

Today, time to market plays a critical role in deciding the make or break of a software solution. Any product or application, be it for personal use or for an enterprise can lose the competitive edge and market share if not launched at the right time. And while trying to achieve this, the quality of the solution cannot be compromised. So having the required test data for the application under test (AUT) becomes pivotal and one of the key factors in assuring the software quality.

Importance Of Test Data & Need For Test Data Management

Historically, application teams created data for testing in a siloed unstructured fashion and the approach sustained then.

Today, the applications and solutions are more complex, integrated with multiple third-party systems, leveraging multiple data sources together, across a range of interconnected systems. With data entry required at multiple layers, there is a greater emphasis and pressure on testing.

Further in enterprise-level applications, any defect or production failure can be devastating. It can, not only lead to financial losses but also hit hard on the enterprise’s image and trust. Also, in cases where software failure impacts regulatory compliance, companies can be subjected to severe financial penalties. So having the “right” test data becomes essential in supplementing the testing efforts for ensuring the highest level of success, which is where test data management (TDM) can be adopted.

TDM is the process of collecting/creating and managing the test data used for testing purposes by the teams. The process should ensure that the data provisioned is suitable, of the right quantity, has the correct required format and more importantly, is available at the appropriate time.

Benefits Of TDM

Having a robust and comprehensive TDM process in place can be a critical business enabler. Below are few points that depict the benefits:

Faster test data preparation.
Better test data quality.
Wider test coverage.
Quicker data refresh.
Repeatable data quality.
Lower test data preparation costs.
Secure and compliant test data.
Cost reduction by finding the defects early.

The Starting Point

Understanding the requirements is the primary key that can aid in the best possible collection of high-quality test data. Input from business users, business analysts can further help channelize this collection effort. Additionally having insights and clarity around the following can help in the collection effort:

The business relevance of data being collected.
The amount of data needed. Collecting too much might lead to test inefficiency while having too little might be ineffective.
At what point the data will be needed.
If the data can be reused.
Data that would need to be masked, if being cloned from the production environment.

Standard test techniques like Boundary value analysis, Equivalence class partitioning can further help optimize and refine this collection effort.

Knowing The Data

With a good understanding of the requirements, the next step is to define a systematic approach to collect the test data. The TDM process should ensure that the data being collected does not end up in a heap of scattered, unrelated data. As one builds this framework, having a good understanding of data attributes can aid during the definition stage :

Data’s target consumption area. Where will be the data used?

Will it be consumed at the environment level, say at base system configuration level like Operating system, Application Server, Databases?
Is it the baseline/default data which is the minimum required prerequisite that has to be available in the application by default?
Is it the input data that is needed to test the application behaviour and observe the output which would be expected to match the defined result to be able to conclude the correctness of the application?

The source of this data collection?

Will it be from a production environment?
Will the data be artificially created/synthetic data?

Would the data need to have a relationship between it or adherence?

Will it need to have a certain range or boundaries? eg: Min, Max.
Will it need to support a certain domain and need adherence in a specific format?

Other data attributes could be:

It’s expected size/length.
It’s age/period.
It’s accuracy.

Building The Data Set

With good insights to the data’s consumption area and its attributes, the next step would be to build these test data sets. Depending on the requirements, the data could be built in one of these ways :

From the production environment:
This source is often preferred for the majority of test data needs, as it’s the most original and reliable data representing the real characteristics. However, while using this, the tester needs to ensure that the data set is properly sanitized to minimize the risk of data security breaches. Having a comprehensive understanding of the data set and business domain will establish the criteria needed to properly optimize and mask the data. It will also ensure compliance to regulations like GDPR, PCI DSS, HIPAA etc (whichever applicable). The data from production can typically be obtained by:
1. Cloning the production data.
2. Selecting data on a sampling basis.
3. Coping specific functional subset from the production.
Artificial synthetic data:
This is useful in cases where the production data sets may not contain values of interest for the test or when working with a new system or when the data fields do not have historical or existing production equivalents. The use of synthetic data is considered best practice for a unit or white box testing. For integration, end to end testing, or other complex types of testing, using synthetic data will be time-consuming and not very cost-effective. But then there are always exceptions, as this kind of data can be the best source in specific scenarios where production data samples are not available or impractical to obtain.

In scenarios where the volume of test data required is low, the data can be built directly by hitting the API’s or the database.

The Challenges With TDM

Among the benefits that the TDM process offers, there are few considerations/challenges as well that should be taken into account and not ignored. Having insights on these beforehand can help manage the process properly. Few of these include:

The additional time that would be required for data identification and its setup.
Additional administrative efforts in test data management.
The sensitivity of private information (PII, credit cards, etc.) when using data from production environments.
The complexity of data incase of end to end testing scenario and ensuring the data consistency.
Timely data revisions.
Storage required for test data.
Continuous cleaning of test data in case of change of its state.
Potential for data loss.
Identification of data anomalies.
Test priority confliction.

Managing The Challenges

With insights into these challenges, appropriate checkpoints and solutions can be designed to help mitigate the impact. Some of them can be:

Capture the time and efforts associated with TDM during test estimations and also ensure that this process is covered as part of the test strategy.
Incorporate a robust change management process.
Collect smaller data sets that accurately sample full data coverage.
Testing environments and data requirements are well-defined.
Back up data and assign versions.
Log the versions with relevant details for quick reference and conversions.
Masking and de-identification of sensitive information. Masking tools could be leveraged for this.
Maintain records of data distribution.
Refresh test data as needed, including periodic updates with new extracts, to accurately cover customer data.
Regularly scheduled maintenance.

Conclusion

Establishing a solid test data management process continues to be a challenge considering the ever-evolving software implementation techniques and the associated testing approach to these. Hence, creating a process that can manage all of the data is quite an undertaking. But then considering the advantages TDM brings, it is definitely worth the effort. This overview of TDM would not be complete if we do not touch upon the multiple automated solutions that are today available in the market. Automation can be used to expedite processes, lower resource cost, and provide a mechanism for ensuring repeatability and scalability. Selecting appropriate automation tools can help to reduce and better manage the challenges, enabling test teams to focus their effort and time on uncovering defects before they slip into production.