Designing Easily Testable Spark Code

Well designed code is reusable, modular, maintainable… and easily testable.

It’s hard to quantify the modularity or maintainability of code, but it’s easy to measure the code’s testability. Just take a chunk of code and try to test it 😜

This blog post demonstrates how to write DataFrame transformations and user defined functions that are easy to test, so you can enjoy the benefits of beautiful Spark code.

Why should we design code that’s testable?

Sandi Metz wrote a masterful description of how intentional testing reduces costs in her book Practical Object Oriented Design in Ruby.

The most common arguments for having tests are that they reduce bugs and provide documentation, and that writing tests first improves application design.
These benefits, however valid, are proxies for a deeper goal. The true purpose of testing, just like the true purpose of design, is to reduce costs. If writing, maintaining, and running tests consumes more time than would otherwise be needed to fix bugs, write documentation, and design applications tests are clearly not worth writing and no rational person would argue otherwise.
It is common for programmers who are new to testing to find themselves in the unhappy state where the tests they write do cost more than the value those tests provide, and who therefore want to argue about the worth of tests. These are programmers who believed themselves highly productive in their former test-not lives but who have crashed into the test-first wall and stumbled to a halt. Their attempts at test-first programming result in less output, and their desire to regain productivity drives them to revert to old habits and forgo writing tests.
The solution to the problem of costly tests, however, is not to stop testing but instead to get better at it. Getting good value from tests requires clarity of intention and knowing what, when, and how to test.

Spark that’s backed by a test suite is easier to reuse, maintain, and refactor. It’s a code smell when Spark code isn’t backed by a test suite.

A code smell is a surface indication that usually corresponds to a deeper problem in the system. — Martin Fowler

A lot of Spark coders don’t have a strong programming background and it’s hard for them to imagine how tests reduce the costs of developing code.

Let’s start by showing the skeptics how they can design code that’s a lot easier to test. Once they’re writing well designed code, we can try to upsell them on the idea of writing some tests 😉

Structuring code with DataFrame transformations

Let’s take a look at some code that doesn’t use a DataFrame transformation.

val sourceDF = Seq(
(2),
(10)
).toDF("wave_height")

val funDF = sourceDF.withColumn(
"stoke_level",
when($"wave_height" > 6, "radical").otherwise("bummer")
)

funDF.show()
+-----------+-----------+
|wave_height|stoke_level|
+-----------+-----------+
| 2| bummer|
| 10| radical|
+-----------+-----------+

This code is really difficult to test and impossible to reuse. We’d need to copy and paste the withColumn function to invoke the code elsewhere 😱

Let’s refactor the code with a DataFrame transformation.

val sourceDF = Seq(
(2),
(10)
).toDF("wave_height")

def withStokeLevel()(df: DataFrame): DataFrame = {
df.withColumn(
"stoke_level",
when($"wave_height" > 6, "radical").otherwise("bummer")
)
}

val funDF = sourceDF.transform(withStokeLevel())

The withStokeLevel DataFrame transformation can be easily reused and tested 🎊

It’s usually easy to wrap code in a DataFrame transformation and this is the easiest way to up your Spark game!

Structuring code with user defined functions

Let’s take a look at some code that doesn’t leverage a user defined function.

val sourceDF = Seq(
("tulips are pretty"),
("apples are yummy")
).toDF("joys")

val happyDF = sourceDF.withColumn(
"joys_contains_tulips",
$"joys".contains("tulips")
)

happyDF.show()
+-----------------+--------------------+
| joys|joys_contains_tulips|
+-----------------+--------------------+
|tulips are pretty| true|
| apples are yummy| false|
+-----------------+--------------------+

This code isn’t testable or reusable. Let’s refactor the code with a user defined function.

val sourceDF = Seq(
("tulips are pretty"),
("apples are yummy")
).toDF("joys")

def containsTulips(str: String): Boolean = {
str.contains("tulips")
}

val containsTulipsUDF = udf[Boolean, String](containsTulips)

val happyDF = sourceDF.withColumn(
"joys_contains_tulips",
containsTulipsUDF($"joys")
)

Fantastic — the containsTulips and containsTulipsUDF functions can now be tested and reused!

It’s a lot easier and faster to test regular Scala functions than user defined functions. e.g. containsTulips is easier to test than containsTulipsUDF.

In real world applications, containsTulips will have complex logic and we can use standard Scala testing techniques to cover all the edge cases. We can use a single integration test to make sure containsTulipsUDF functions as expected.

Writing single purpose DataFrame transformations

Methods, like classes, should have a single responsibility. — Sandi Metz

DataFrame transformations should only do one thing, consistent with the Unix philosophy and the single responsibility principle. Functions that do one thing are easier to reuse and maintain.

Let’s look at a DataFrame transformation that does two things.

def doesStuff()(df: DataFrame): DataFrame = {
df
.filter($"num" % 2 === 0)
.withColumn("food", lit("rum ham"))
}

val sourceDF = Seq(
(2),
(5),
(10)
).toDF("num")

val sunnyDF = sourceDF.transform(doesStuff())

sunnyDF.show()
+---+-------+
|num| food|
+---+-------+
| 2|rum ham|
| 10|rum ham|
+---+-------+

The doesStuff transformation filters out all of the odd numbers from the DataFrame and appends a food column.

Let’s split up doesStuff into two DataFrame transformations to make the code more modular and testable.

def filterOddNumbers()(df: DataFrame): DataFrame = {
df.filter($"num" % 2 === 0)
}

def withRumHam()(df: DataFrame): DataFrame = {
df.withColumn("food", lit("rum ham"))
}

val sourceDF = Seq(
(2),
(5),
(10)
).toDF("num")

val sunnyDF = sourceDF
.transform(filterOddNumbers())
.transform(withRumHam())

filterOddNumbers and withRumHam can be used independently, so the code is more modular. These DataFrame transformations can also be tested independently — it’s always easier to test single purpose functions.

Testing methods that chain multiple DataFrame transformations

Ideally your code will be broken up into a bunch of single purpose DataFrame transformations and user defined functions. When you need to run all the code, you can chain the functions together.

def model()(df: DataFrame): DataFrame = {
.transform(standardizeFirstName())
.transform(standardizeLastName())
.withColumn("age_bucket", ageBucketizer($"age"))
}

DataFrame transformations like model can be really hard to test. You might need to build up very complex DataFrames to account for all the DataFrame requirements and their interrelation.

Scala doesn’t have a good mocking framework, so it’s not easy to test this method with stubs or doubles either.

I recommend testing the individual components (e.g. test standardizeFirstName, standardizeLastName, and ageBucketizer) and writing a “does not blow up” test for really complex transformations like model. I’ll write a blog post on advanced Spark testing techniques to cover this approach in more detail.

Disclaimers on the code snippets in this post

I intentionally simplified some of the code snippets in this post, but we can do better with out production code.

Let’s look at the filterOddNumbers DataFrame transformation.

def filterOddNumbers()(df: DataFrame): DataFrame = {
df.filter($"num" % 2 === 0)
}

If we only want the filterOddNumbers transformations to work on DataFrames with a num column, we should add a DataFrame validation, as described in this blog post.

def filterOddNumbers()(df: DataFrame): DataFrame = {
validatePresenceOfColumns(df, Seq("num"))
df.filter($"num" % 2 === 0)
}

It’d probably be better to structure filterOddNumbers as a schema independent transformation, so it’s more flexible.

def filterOddNumbers(colName: String)(df: DataFrame): DataFrame = {
df.filter(col(colName) % 2 === 0)
}

Conclusion

We can craft well designed Spark code with DataFrame transformations and user defined functions. Well designed code is easier to refactor, maintain, and reuse, which lowers the overall cost of our application.

Well designed code is easily testable and vice versa.

It’s a code smell when programs are difficult to test. Don’t shy away from the challenge of properly testing your code. Lean in, redesign your code so it’s more testable, and write some specs to start improving your code coverage 😉

Versions used in this post

  • Scala: 2.11
  • Spark 2.2.0