“Big Data Testing Challenges”

Tahmina Naznin
Oceanize Lab Geeks
Published in
4 min readSep 17, 2018

Big data is a collection of large datasets that cannot be processed using traditional computing techniques. It has many different uses — real-time fraud detection, web display advertising and competitive analysis, call center optimization, social media and sentiment analysis, intelligent traffic management and smart power grids and so on.

Big Data Testing Strategy:

Testing Big Data application is more a verification of its data processing rather than testing the individual features of the software product. When it comes to Big data testing, performance and functional testing are the keys. In Big data testing QA engineers verify the successful processing of terabytes of data using commodity cluster and other supportive components. It demands a high level of testing skills as the processing is very fast. Processing may be of three types:

Along with this, data quality is also an important factor in big data testing. Before testing the application, it is necessary to check the quality of data and should be considered as a part of database testing. It involves checking various characteristics like conformity, accuracy, duplication, consistency, validity, data completeness, etc.

Big Data Testing can be categorized into three stages:

Step 1: Data Staging Validation

The first stage of big data testing is also known as a Pre-Hadoop stage which comprises of process validation.

  1. Validation of data is very important so that the data collected from various source like RDBMS, weblogs etc is verified and then added to the system.
  2. To ensure data match you should compare source data with the data added to the Hadoop system.
  3. Make sure that the right data is taken out and loaded into the accurate HDFS location

Step 2: “Map Reduce” Validation

Validation of “Map Reduce” is the second stage. Business logic validation on every node is performed by the tester. Post that authentication is done by running them against multiple nodes, to make sure that the:

  • The process of Map Reduce works perfectly.
  • On the data, the data aggregation or segregation rules are imposed.
  • Creation of key-value pairs is there.
  • After the Map-Reduce process, Data validation is done.

Step 3: Output Validation Phase

The output validation process is the final or third stage involved in big data testing. The output data files are created and they are ready to be moved to an EDW (Enterprise Data Warehouse) or any other such system as per requirements. The third stage consisted of:

  • Checking on the transformation rules are accurately applied.
  • In the target system, it needs to ensure that data is loaded successfully and the integrity of data is maintained.
  • By comparing the target data with the HDFS file system data, it is checked that there is no data corruption.

Architecture Testing

Hadoop processes very large volumes of data and is highly resource intensive. Hence, architectural testing is crucial to ensure success of your Big Data project. Poorly or improper designed system may lead to performance degradation, and the system could fail to meet the requirement. Atleast, Performance and Failover test services should be done in a Hadoop environment.

Performance testing includes testing of job completion time, memory utilization, data throughput and similar system metrics. While the motive of Failover test service is to verify that data processing occurs seamlessly in case of failure of data nodes.

Performance Testing

Performance Testing for Big Data includes two main action

  • Data ingestion and Throughout: In this stage, the tester verifies how the fast system can consume data from various data source. Testing involves identifying different message that the queue can process in a given time frame. It also includes how quickly data can be inserted into underlying data store for example insertion rate into a Mongo and Cassandra database.
  • Data Processing: It involves verifying the speed with which the queries or map reduce jobs are executed. It also includes testing the data processing in isolation when the underlying data store is populated within the data sets. For example running Map Reduce jobs on the underlying HDFS
  • Sub-Component Performance: These systems are made up of multiple components, and it is essential to test each of these components in isolation. For example, how quickly message is indexed and consumed, mapreduce jobs, query performance, search, etc.

Performance Testing Approach

Performance testing for big data application involves testing of huge volumes of structured and unstructured data, and it requires a specific testing approach to test such massive data.

Performance Testing is executed in this order

  1. Process begins with the setting of the Big data cluster which is to be tested for performance
  2. Identify and design corresponding workloads
  3. Prepare individual clients (Custom Scripts are created)
  4. Execute the test and analyzes the result (If objectives are not met then tune the component and re-execute)
  5. Optimum Configuration

Tools used in Big Data Scenarios

Challenges in Big Data Testing

  • Automation

Automation testing for Big data requires someone with a technical expertise. Also, automated tools are not equipped to handle unexpected problems that arise during testing

  • Virtualization

It is one of the integral phases of testing. Virtual machine latency creates timing problems in real time big data testing. Also managing images in Big data is a hassle.

  • Large Dataset
  • Need to verify more data and need to do it faster
  • Need to automate the testing effort
  • Need to be able to test across different platform

Resource: Guru 99 & CABOT.

--

--