BDD tests for Hadoop with Cucumber. Part I

Published in

SavvyClutch Engineering Club

3 min readOct 7, 2016

Big Data is a processing of a huge amount of structured (more or less) data that should be filtered or sorted for future analysis or, actually made an analysis. So much data that you can’t process it on the one computer for a reasonable time. This can be: logs from services (like call service) or logs from web servers that contain billions of records, so you should process data in parallel on a bunch of computers. To do it so we use specific software based on MapReduce pattern. In our case, we use Hadoop with Cascading framework. Hadoop implements MapReduce and Cascading framework contains a lot of useful tools, for example, abstraction to Amazon S3, so we can use S3 buckets as input/output folders for processing jobs, but at the same time, we allow to use regular file systems on developer machines for development and test purposes.

Like any other piece of software, it should be tested before using in production, especially because of the high cost of logical mistakes in processing a lot amount of data. Unfortunately, there is no much information about BDD testing of Hadoop and Hive jobs. So I decided to write how we did it in our Cascading project in Intelliarts.

What do we have

Assume we have a lot of data about phone calls in a bunch of files. The file contains lines in the following format: Id,country,time,duration Where id is a call unique id, country is a caller country, time is a time when call started (timestamp) and duration is a call duration in milliseconds.

For example:

0000,UA,1433998201,60000 0001,US,1433998201,30000 0001,GB,1433998301,30000 ...

What do we want to do?

Let’s sort calls from these records by country (each country will be in a separate folder) and change the format of records — replace delimiters from , with tab symbol \t. From the data of the previous example so we should get the folders US, GB, UA and UA will contain a file with records looks like this:

id	country	time	duration 0000	UA	1433998201	60000

And, we want to be sure that we did not make any mistakes during implementation, so we want to test the results. What we will use:

Hadoop + Cascading for data processing
Gradle for build management
Cucumber for tests
Chunk Templates for test data fixture generation
Docker for environment isolation

First of all, we should isolate our testing environment and make it portable between dev machines and CI. We use Docker for this propose. Inside the container we will:

Install Java
Install Hadoop
Install Gradle
Add codebase

Here is Dockerfile content:

Okay, now we have the environment. Let’s create a project Gradle config to download all dependencies for Hadoop application:

As you can see there are two custom tasks — cucumber and jar. First one runs cucumber tests and the second one compiles the source to the production jar-file. Now we can build the image:

$ docker build -t bdd_hadoop .

Time to build a processing application. Let’s create a directory structure: src > main > java > com > processing and add a file CallStream.java with content:

Let’s try to build a jar file:

$ docker run --rm bdd_hadoop gradle jar

Now we need to create some acceptance tests. This will be in Part II. Stay tuned!

Originally published at www.savvyclutch.com on October 7, 2016.

BDD tests for Hadoop with Cucumber. Part I

What do we have

What do we want to do?

Written by Bogdan Frankovskyi