Unit Testing Apache Spark Applications using Hive Tables
Techniques for creating and managing unit tests of Spark batch applications
At HomeAway, we have many batch applications that use Apache Spark to process data from Hive tables based on S3 datasets. These applications perform Spark SQL transformations to generate their final output. The transformations are written similarly to an SQL stored procedure in an RDBMS. We wanted to bake unit tests into the intermediate steps to ensure our applications are robust and can catch any breaking changes in the future. However, we faced some challenges.
Challenges
- Hive tables and S3 datasets are not available on developers’ local machines (or on build servers like Jenkins) to run tests without additional setup.
- The code needs a local Spark session to run. We needed a way to get a Spark session for local (and Jenkins) that does not connect to the Hive Metastore.
The team evaluated many options and established best practice to use the spark-testing-base library for unit testing such applications. Below are some key techniques that helped us unit test our Spark application.
Toolset
spark-testing-base
spark-testing-base is a library that simplifies the unit testing of Spark applications. It provides utility classes to create out-of-the-box Spark sessions and DataFrame utility methods that can be used in assert statements.
ScalaTest
ScalaTest is a powerful tool that can be used to unit test Scala and Java code. It is similar to JUnit for Java.
Setup
- Add a dependency for spark-testing-base to your
pom.xml
:
<dependency>
<groupId>com.holdenkarau</groupId>
<artifactId>spark-testing-base_2.11</artifactId>
<version>${spark.version}_0.10.0</version>
<scope>test</scope>
</dependency>
- Add the ScalaTest Maven plugin to your
pom.xml
:
<plugin>
<groupId>org.scalatest</groupId>
<artifactId>scalatest-maven-plugin</artifactId>
<version>2.0.0</version>
<configuration>
<reportsDirectory>${project.build.directory}/surefire-reports</reportsDirectory>
<junitxml>.</junitxml>
<filereports>WDF TestSuite.txt</filereports>
</configuration>
<executions>
<execution>
<id>test</id>
<goals>
<goal>test</goal>
</goals>
</execution>
</executions>
</plugin>
Key techniques we used
Using the out-of-the-box local Spark session
Using trait SharedSparkContext makes a locally generated Spark session available to use without doing anything else.
We refactored our classes to receive this Spark session from the Main
class using dependency injection. In the production environment, this points to the production Hive Metastore. During local testing and unit tests this points to the Metastore in the JVM.
Test bed (test database) setup
All the databases and tables that the application uses can be defined up front pointing to a temporary location. This can be done in a utility class that does this test bed setup. We will then call this utility class before the tests.
Moreover, any tables that are needed for the code to work, but not necessarily needed for unit testing, can be defined up front as empty tables.
We’ll discuss defining tables with specific data for unit testing in the next section.
Using CSV files to populate Hive tables
We found that using CSV was pretty simple for defining data that any of our tables needed for unit testing. Once set up, CSVs were simple to edit for various data scenarios.
Steps:
- Use
StructType
to define the table schema. Only define columns that your program uses. There is no point in defining all the columns that are in the Hive table if they are not used in our app! - Use Spark CSV Reader to create a
DataFrame
pointing to CSV files stored intest/resources
- Store CSV with all data combinations in
test/resources
- Use Spark’s
saveAsTable
method to define a Hive table from thisDataFrame
DataFrame assert method
Trait DataFrameSuiteBase provides method named assertsDataFrameEquals
that can be used to compare two DataFrame
s.
Using ScalaTest Matchers to assert DataFrame elements
ScalaTest trait matchers provides easy to read assert checks. We used these to test individual elements in the test result DataFrames.
Conclusion
These were some techniques we used to unit test our Spark batch applications. There are some features that spark-testing-base provides like generating test datasets, dataframes, and resilient distributed datasets (RDDs) that are useful. The StreamingSuiteBase also looks very promising and easy to use for Spark streaming applications.
Being new to Scala and then looking at all the features ScalaTest provides was refreshing. We used FunSuite, but there are many different styles we can explore and incorporate.
This was an important step for our team towards improving our unit testing coverage for Spark batch applications. Testing Scala-based Spark code snippets in notebooks and then porting them in a Scala app improves time to release.
We are excited to continue building more Unit Testing techniques for other Spark based applications!