Big Data & QA — A Concise Overview

5 min readJul 4, 2019

Assume you must test 100 TBs of unstructured, unindexed data that has no cache attached. Can you feel the panic arising from all the things that could go wrong with this project? Are bottlenecks and slow processing the first things that come to mind? Add uncleaned data, unknown errors, transmission faults and ensuring that operations are applied to the whole volume, and you are still not even close to what Big Data testing means.

What is Big Data

First let’s get our definition straight on what constitutes big data. A common approach is to define big data (or the lack thereof) in terms of the 3V’s of data: Volume, Velocity, and Variety. High volume is the biggest clue but not the only one. Velocity — speed of creation — is critical, as is a wide variety of data types thanks to unstructured data.

Unlike structured data, unstructured data does not have a defined data model. Unstructured data includes social media like Twitter and Facebook, email and chat applications, video and audio files, digital photos, voicemail, call center records, and photos. In these are just human-generated files. Once you get into machine-generated files then you’re talking about massive and fat-growing volumes of data.

Key Components of Big Data Application Testing

As Big Data is described through the above-mentioned three Vs, you need to know how to process all this data through its various formats at high speed. This processing can be split into three basic components. To be successful, QA engineers will have to be aware of these components.

1. Data Validation: Understandably, this is one of the most important components of data collection. To ensure the data is not corrupted or is accurate, it is important that it is validated. For this purpose, the sources will be checked. The information procured is validated against actual business requirements. The initial data will be fed into Hadoop Distributed File System (HDFS), and this will also be validated. The file partition will be checked thoroughly, followed by copying them into different data units. Tools like Datameer, Talent and Informatica are used for step-by-step validation.

Data validation is also known as pre-Hadoop testing, and makes it certain that the collected data is from the right resources. Once that step is completed, it is then pushed into the Hadoop testing system for tallying with the source data.

2. Process Validation: Once the data and the source are matched, they will be pushed to the right location. This would be the Business Logic validation or Process Validation, where the QA engineer will verify the business logic, node by node, and then verify it against different nodes. Business Logic Validation is the validation of Map Reduce, the heart of Hadoop.

The QA engineer will validate the Map-Reduce process and check if the key-value pair is generated correctly. Through “reduce” operation, the aggregation and consolidation of data is checked out.

3. Output Validation: Output validation is the next important component. Here the generated data is loaded into the downstream system. This could be a data repository and the data goes through analysis and further processing. This is then further checked to make sure the data is not distorted, by comparing HDFS file system with target data.

Architecture testing is another crucial part of Big Data testing, as having poor architecture will make the whole effort go wasted. Luckily, Hadoop is highly resource intensive, and is capable of processing huge amounts of data and for this, architectural testing becomes mandatory. It is also important to ensure that there is no data corruption, and compare the HDFS file system data with target UI or business intelligence system.

Big Data Testing Challenges

This process requires a high level of automation given massive data volumes, and the speed of unstructured data creation. However, even with automated tool-sets big data testing isn’t easy.

Good source data and reliable data insertion: “Garbage in, garbage out” applies. You need good source data to test, and a reliable method of moving the data from the source into the testing environment.
Test tools require training and skill: Automated testing for unstructured data is highly complex with many steps. In addition, there will always be problems that pop up during a big data test phase. QA engineers will need to know how to problem-solve despite unstructured data complexity.
Setting up the testing environment takes time and money: Hadoop eases the pain because it was created as a commodity-based big data analytics platform. However, IT still needs to buy, deploy, maintain, and configure Hadoop clusters as needed for testing phases. Even with a Hadoop cloud provider, provisioning the cluster requires resources, consultation, and service level agreements.
Virtualization challenges: Few business application vendors do not develop for virtual environments, so virtualized testing is a necessity. Virtualized images can introduce latency into big data tests, and managing virtual images in a big data environment is not a straightforward process.
No end-to-end big unstructured data testing tools: No vendor tool-set can run big data tests on all unstructured data types. QA engineers need to invest in and learn multiple tools depending on the data types they need to test.

Big Data Automation Testing Tools

Testing big data applications is significantly more complex than testing regular applications. Big data automation testing tools help in automating the repetitive tasks involved in testing.

Any tool used for automation testing of big data applications must fulfill the following needs:

Allow automation of the complete software testing process
Since database testing is a large part of big data testing, it should support tracking the data as it gets transformed from the source data to the target data after being processed through the MapReduce algorithm and other ETL transformations.
Scalable but at the same time, it should be flexible enough to incorporate changes as the application complexity increases
Integrate with disparate systems and platforms like Hadoop, Teradata, MongoDB, AWS, other NoSQL products etc
Integrate with dev ops solutions to support continuous delivery
Good reporting features that help you identify bad data and defects in the system

Conclusion

Transforming data with intelligence is a huge concern. As big data is integral to a company’s decision making strategy, it is not even possible to begin asserting the importance of arming yourself with reliable information.

Big Data processing is a very promising field in today’s complex business environment. Applying the right dose of test strategies, and following best practices would help ensure qualitative software testing. The idea is to recognize and identify the defects in the early stages of testing and rectify them. This helps in cost reduction and better realization of company goals. Through this process, the problems that QA engineers faced during quality assurance planning & software verification are all solved now because the testing approaches are all driven by data.