Checklist For Testing BigData Systems

Nayan Gaur
Jan 10 · 5 min read

Big Data is always the buzzword in the software industry primarily due to a large amount of data generated which can not be processed with traditional computing methods. Big Data has a characteristic of 3 Vs ie. Volume (amount of data), Velocity (Speed of data in and out), Variety (Range of data type and sources).

Having said that, testing big data systems comes with altogether new complexity, it requires a lot of skills. There are no structured strategies to test such systems as your scripts, checks and validations will depend on business logic, however there are certain methods or checklists to take care while testing big data systems. This article will primarily revolve around such checklists and strategies.

Note — This article will NOT cover the HOWs and implementations.

Before diving into checklists, lets also quickly discuss the components of any application handling big data -

  • Data Ingestion, is the layer where data ie large amount of data, will be injected to Big Data systems. This data can be structured, semi-structured or unstructured and storage can be Hadoop, MongoDB or any other storage. Its also called pre-processing stage where testing things is very critical, if we go wrong at this stage whole pipeline would get affected with incorrect analysis.
  • Data processing, as the name suggest, in this layer the processing of ingested data happens. Processing like aggregation of data based on business rules and eventually the formation of key value pairs which will be processed through map-reduce jobs.
  • Data Visualisation, once the data is processed as per business rules, the processed data gets ETL to either directly to data warehouse or some systems has target source as an indeterminate step through which data get loaded to data warehouse thereafter so that the meaningful information can be extracted through business intelligence and analytics.
Big Data pipeline
Map-Reduce Data flow

Let’s begin discussing the checklist —

  • Setting up test environment is very critical, ensure the test environment have enough space to process and store large amount of data based on application under test, also it should be set up as a distributed clusters for each components.
  • Prepare the test data both positive and negative to cover all business scenarios, initially try to use small sample of data(in KBs) while verifying things, this will make things bit easy like to verify the correctness of ingested data against source data, or to verify if correct business rules implemented on ingested data and data aggregation is correctly done by comparing output file with input files.
  • Customised alerting and logging at every stage of data flow is critical especially when running execution with large volume of test data, which will assist in debugging and catching the system bugs.
  • Performance testing of BigData systems is very important, so it involve the activities like setting up the load test scripts and analyse the metrics.
  • Check the metrics like the throughput, error rate and time taken for the data to be ingested into data store like insertion rate, the messages that queue can process and map-reduce jobs executed in isolation for HDFS data.
  • Since system is made of different component, it is critical to perform load test for each component in isolation.
  • While performing load tests consider the factors like how much the application logs will grow, how the caching is done both row cache and key cache, timeouts like query timeouts and connection timeouts, message queue and so on.
  • During data ingestion, data is generated from multiple sources like RDBMS, social media, logs and so on with different formats, so preparing all possible format of data would be the key. And verify the data is ingested as per defined schema. Tools like QuerySurge, Datameer, Talend can be used.
  • Ensure correctness of data like comparison of source data vs ingested data, and data is loaded into correct HDFS location.Use custom scripts, alerting and logging to debug the ingestion and correctness of large amount of data. Some common tool and library used in ingestion are — Kafka, Zookeeper, Sqoop, Flume, Storm, Amazon Kinesis.
  • During data processing, make sure the ingested data is processed correctly as per business logic through map-reduce jobs. So preparing a unit tests using MRUnit is a good idea to validate the correctness of key-value pair generated and business logic is applied correctly after map-reduce job execution.
  • Ensure proper logging and exception handling is implemented for each map-reduce jobs, and validate if data aggregation or segregation rules are implemented on the data. Some common tool used in processing layer are — Hadoop (Map-Reduce), Cascading, Oozie, Hive, Pig.
  • Once data is processed, it get ETL into target HDFS or data warehouse. So testing the ETL into target system, the correctness and integrity of data loaded into target system should definitely be in checklist.
  • Verify if all transformation rules are applied correctly over data in target system, and check that there is no data corruption by comparing the target data with the HDFS file system data.
  • Exception and error should be logged properly with proper message so that debugging and fixing is easy, you will thank yourself if this done correctly while testing things with large amount of test data.
  • Don’t miss the integration testing starting from data ingestion to data visualisation, that is, testing system as a whole.
  • Chaos testing, becomes very critical to test such big systems for any chaos happen during execution like verify the seamless processing of data end to end if any node die or fail.
  • System should recover through recovery mechanism like switching to other data nodes to process data.

Hope the above checklist would add value in testing systems involving big data.

Data Driven Investor

from confusion to clarity, not insanity

Nayan Gaur

Written by

I write and share strategies on Leadership | technology | Product Quality.

Data Driven Investor

from confusion to clarity, not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade