Apache Spark with cucumber

5 min readMay 22, 2024

This blog provides simple illustration of cucumber framework with Apache spark using Scala language by developing sample program.

Before beginning lets cover few basic concepts around BDD and its working

What is BDD?

It is way by which software teams can minimise gap between business people and technical people while working together
It is acronym for — ‘Behaviour-Driven Development’

How does BDD work?

By using three practices — Discovery, Formulation and Automation in iterative process

In nut shell, on day-to-day basis business users and technical teams discuss to understand expectation of system with real time examples. These real time examples then articulated in documents which can be understand by humans as well as by computers. i.e. ‘.feature’ files. To create these feature files we use Gherkin syntax. These feature files act as common agreement between end users and software development team. Once it get accepted by end users, then finally software development team will provide implementation by following these feature file as guide.
So, What is Cucumber then ? It is tool which supports BDD

Let’s not waste time and try to create sample spark program using IntelliJ, Scala and maven.

First we will create project named as ‘spark-cucumber’ with as usual directory structure for spark application as follows —

Before proceeding with next steps, just wanted to make ensure to add required libraries to our project for spark and cucumber framework in pom.xml. Below is the list of required libraries -

Cucumber scala — io.cucumber » cucumber-scala
cucumber-junit — io.cucumber » cucumber-junit
junit
junit-interface — com.github.sbt » junit-interface
spark core, and sql

Let's add sample code snippet which I have already developed to Scala singleton object file named as ‘MySpark’. Then MySpark object file will look like this —

This sample code is nothing but creating employee data frame with attributes employee id, name, department id and salary by using sample data.
Then applying aggregation logic on top of employee data frame to get average salary of employees per department

Let’s test our spark job whether it is working out or not. Run it by executing main method and you can see result as -

Now, creating feature file at resources/features directory named as ‘Employee_And_Department’. Adding scenario to this feature file with ‘Given’ step where current state of system is mentioned by employee data frame with sample data. Adding remaining steps ‘When’ to provide event details. At last, providing sample data in ‘Then’ step which compares data from sample data with actual result from developed MySpark job.

Feature file is now like this —

Please note — if you are not able to see these green ‘>>’ signs on feature file. It means cucumber plugin is not installed on your IntelliJ/IDE. You can apply that plugin by downloading it from marketplace

Now, adding directory ‘scala’ under test directory of project. This will hold all test step definitions. Then, let’s create Step definitions by using auto completion feature of IDE (just hover over feature file and it will provide option to create Step definition) as follows -

This will create file as follows —

As we can see, Step Definitions class contains three methods to denote three steps from feature file i.e. Given, When and Then

Now we need to provide implementation for these three methods.

Before providing definitions of these methods let’s understand what input they are receiving.

For example, If we take close look at Given method, we can see it is receiving two parameters i.e. one is String and one is data table. If we look at feature file we can see, Given keyword is present to state current system condition with values i.e. in this case stating is provided by ‘employee’ data frame with help of list of values in tabular format.

These list of values are nothing but data table which we need to convert to into spark data frame. Hence, adding one more helper class in test resources which will provide functionality to convert these data tables into spark data frame.