If you are wondering how you can test with Apache Spark. Or if you are curious about how other projects deal with tests, this article is for you. I will show examples in Scala with Specs2 but the global idea can work with any language or framework test.
What to test with Apache Spark?
This is the first problem.
Suppose that we have data about diamonds sales. We want to extract only one field (the diamond color) from all the information.
Here, we are working with datasets. We could extract a function that takes a diamond as input and gives another one as output:
Our function selectColor is like any other function in Scala. We can test it the same way.
But most of the times, what we want to test is more complex.
Let’s establish some naive coincidences between the diamonds prices and the trendy colors.
We have data about diamonds sales on one hand:
and trendy colors on the other hand:
And then our code is going to be more complex:
Again, we can extract a function here like this:
Here to test, we need a Spark Session (the Spark Session is the entry point to execute all the functions of Spark).
We need to create our datasets and to join them.
We also need a SparkSession to import Spark implicits. We can notice that because we have a SparkSession in our method signature. It is useful to make implicits transformations needed by Spark. In this example, it is useful to cast our final Dataset in Result.
Some people use a mini cluster to get a SparkSession in their tests. It leads to slow tests.
Spark Session in our tests
We found another solution. We decided to use a Spark Session in our tests.
You could do something like this :
We do not want to test Spark. We are pretty sure it works. But we want to test if we are good using Spark.
Spark Session in a wrapper
We create a spark session for our test. But we are going to need it for almost all our tests. We found (thanks to the web and their bloggers) a solution.
We are going to use a wrapper for all our tests that need a Spark session.
We build a trait with a Spark session this way:
And use it this way:
Now we can use the same Spark session for all our tests.
External files VS DataFrames/Datasets creation
At the beginning, I was using files as inputs to respect more our real inputs. But sometimes perfectionism can be your enemy. Now I prefer using DataFrames or Datasets creation with the function “toDS” or “toDF”.
That was a good start. I was very glad to have found all these solutions. But I realized that it was not enough.
We faced performance issues.
For instance, we had code like as followed.
The problem is that “count” and “collect” are two Spark actions. Only actions launch the computation. Using “count” and then “collect”, we are launching twice the “priceOfDiamondsWithTrendyColors” function.
To avoid that, we decided to use once “collect” to work with Scala arrays and not Spark. And then with this array, we make our tests.
Run once globally, test unitarly
To speed my tests, I run once a global action with Spark and test it unitarly.
Suppose I have a function that takes a diamonds Datasets as input and return a median price by color.
A classical way is to run a process with pair values and another with impair values.
But Spark is slow when testing.
Another way is to run only once this process. We put pair values for color “green” and impair values for color “red”. We could then check the results in different tests.
In this article, you can find the solutions I use to test with Apache Spark.
I found these ones thanks to several blogs and discussions. I am open to other ways to do it. Feel free to share yours.
I joined some links where you can find more information about this subject.
Resources in english:
Resources in french: