Unit Testing Apache Spark Applications using Hive Tables

Techniques for creating and managing unit tests of Spark batch applications

Published in

Expedia Group Technology

4 min readFeb 25, 2019

At HomeAway, we have many batch applications that use Apache Spark to process data from Hive tables based on S3 datasets. These applications perform Spark SQL transformations to generate their final output. The transformations are written similarly to an SQL stored procedure in an RDBMS. We wanted to bake unit tests into the intermediate steps to ensure our applications are robust and can catch any breaking changes in the future. However, we faced some challenges.

Challenges

Hive tables and S3 datasets are not available on developers’ local machines (or on build servers like Jenkins) to run tests without additional setup.
The code needs a local Spark session to run. We needed a way to get a Spark session for local (and Jenkins) that does not connect to the Hive Metastore.

The team evaluated many options and established best practice to use the spark-testing-base library for unit testing such applications. Below are some key techniques that helped us unit test our Spark application.

Toolset

spark-testing-base
spark-testing-base is a library that simplifies the unit testing of Spark applications. It provides utility classes to create out-of-the-box Spark sessions and DataFrame utility methods that can be used in assert statements.

ScalaTest
ScalaTest is a powerful tool that can be used to unit test Scala and Java code. It is similar to JUnit for Java.

Setup

Add a dependency for spark-testing-base to your pom.xml:

<dependency>
    <groupId>com.holdenkarau</groupId>
    <artifactId>spark-testing-base_2.11</artifactId>
    <version>${spark.version}_0.10.0</version>
    <scope>test</scope>
</dependency>

Add the ScalaTest Maven plugin to your pom.xml:

<plugin>
    <groupId>org.scalatest</groupId>
    <artifactId>scalatest-maven-plugin</artifactId>
    <version>2.0.0</version>
    <configuration>
        <reportsDirectory>${project.build.directory}/surefire-reports</reportsDirectory>
        <junitxml>.</junitxml>
        <filereports>WDF TestSuite.txt</filereports>
    </configuration>
    <executions>
        <execution>
            <id>test</id>
            <goals>
                <goal>test</goal>
            </goals>
        </execution>
    </executions>
</plugin>

Key techniques we used

Using the out-of-the-box local Spark session

Using trait SharedSparkContext makes a locally generated Spark session available to use without doing anything else.

We refactored our classes to receive this Spark session from the Mainclass using dependency injection. In the production environment, this points to the production Hive Metastore. During local testing and unit tests this points to the Metastore in the JVM.

Test bed (test database) setup

All the databases and tables that the application uses can be defined up front pointing to a temporary location. This can be done in a utility class that does this test bed setup. We will then call this utility class before the tests.

Moreover, any tables that are needed for the code to work, but not necessarily needed for unit testing, can be defined up front as empty tables.

We’ll discuss defining tables with specific data for unit testing in the next section.

Using CSV files to populate Hive tables

We found that using CSV was pretty simple for defining data that any of our tables needed for unit testing. Once set up, CSVs were simple to edit for various data scenarios.

Steps:

Use StructType to define the table schema. Only define columns that your program uses. There is no point in defining all the columns that are in the Hive table if they are not used in our app!
Use Spark CSV Reader to create a DataFrame pointing to CSV files stored in test/resources
Store CSV with all data combinations in test/resources
Use Spark’s saveAsTable method to define a Hive table from this DataFrame

Defining and loading tables for unit tests

DataFrame assert method

Trait DataFrameSuiteBase provides method named assertsDataFrameEquals that can be used to compare two DataFrames.

Using ScalaTest Matchers to assert DataFrame elements

ScalaTest trait matchers provides easy to read assert checks. We used these to test individual elements in the test result DataFrames.

Conclusion

These were some techniques we used to unit test our Spark batch applications. There are some features that spark-testing-base provides like generating test datasets, dataframes, and resilient distributed datasets (RDDs) that are useful. The StreamingSuiteBase also looks very promising and easy to use for Spark streaming applications.

Being new to Scala and then looking at all the features ScalaTest provides was refreshing. We used FunSuite, but there are many different styles we can explore and incorporate.

This was an important step for our team towards improving our unit testing coverage for Spark batch applications. Testing Scala-based Spark code snippets in notebooks and then porting them in a Scala app improves time to release.

We are excited to continue building more Unit Testing techniques for other Spark based applications!