Towards QA II: Testing Spark Apps

Data Reply
DataReply
Published in
2 min readMay 2, 2018

This post is part II in our series on QA. Previously we discussed the concepts of property based testing (PBT) in the context of an individual application. In this post we will focus on applying PBT and other techniques to test applications in Spark.

We will focus on the most popular tesitng library for Spark: Spark-testing-base

sscheck is an honourable mention, but in the interest of brevity we will not be covering it here.

Unit testing

Spark-testing-base allows us to write concise unit tests to test Spark applications, without writing the boilerplate code needed to setup/teardown SparkContext’s. Let’s take a look and see some of the operations you can do with it…

When setting up a project don’t forget to:

1. Increase heap and perm gen size with: javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", "-XX:+CMSClassUnloadingEnabled")2. Disable parallel execution: parallelExecution in Test := false

Initialisation:

Spark-testing-base handles the setup/teardown of SparkContexts for you by means of the SharedSparkContext trait. This automatically sets up a context in local mode to share amongst all members of a test suite. Since Akka will not immediately unbind on shutdown it also clears the spark.driver.port property for you.

All we need to do is create a normal ScalaTest suite and mixin SharedSparkContext:

class myTests extends FunSuite with SharedSparkContexttest("test rdd comparison") { val expectedRDD = sc.parallelize(List(1, 2, 3)) val resultRDD = sc.parallelize(List(3, 2, 1)) /* Assert equal without ordering, and not equal with ordering */ assert(RDDComparisons .compare(expectedRDD, resultRDD) isEmpty) assert(RDDComparisons .compareWithOrder(expectedRDD, resultRDD) nonEmpty) }

Here there are two cases — comparing as permutations or combinations (aka with or without ordering).

Originally published at www.datareply.co.uk.

--

--