Spark: Mocking Read, ReadStream, Write and WriteStream

Saurabh Shashank
Walmart Global Tech Blog
3 min readSep 29, 2022

Low code coverage of the spark pipeline is caused due to missing unit test cases for sources and sinks. Most of our spark scala pipeline’s code coverage was below 70%. We want it to be in the range of 85–95%.

Spark supports multiple batches and stream sources, sinks like

  1. CSV
  2. JSON
  3. Parquet
  4. ORC
  5. Text
  6. JDBC/ODBC connections
  7. Kafka

Limitation of writing test cases for spark read and write brings down overall code coverage and therefore a low overall quality score for the pipeline. As read and write are an omnipresent part of the data pipeline, the corresponding lines of code are a significant percentage of the total lines of code.

Let’s look at the step-by-step guide to improving code quality measures of spark scala pipelines.

Below is a summary of mocking concepts

Mocking in unit test case writing is to test the behaviors and logic of written code

Mock is used for behavior verification where it gives full control over the object and its method

A stub is used to hold data that are mostly limited in nature and used to return during unit test cases

Spy is used for partial mocking where only select behavior needs to be mocked and the remaining needs to be the original behavior of an object

Let’s get into spark specifics

Maven based dependencies

Let’s use an example of a spark class where the source is Hive and the sink is JDBC.

DummySource.scala reads a table from Hive, applies a filter, and then writes to a JDBC table.

We skip transform where other logic is applied as that can be tested using stub

We would not need to write unit test cases where the code connects to the source or sink. So here we want to mock the read and write behavior of spark.
Before we look at the solution, it is better to get ourselves familiarised with Spark DataFrameReader and DataFrameWriter

Since we are going to Mock the below two classes it’s important to know all the methods/behaviors

Mocking Read

The objective here is to avoid making the connection to the source and still get a DataFrame so that our filter logic is unit tested. So we start by mocking the SparkSession

SparkSession.read returns org.apache.spark.sql.DataFrameReader, so now we will mock DataFrameReader

We have all the required mock objects. It is time to control the behaviors. We need mockSpark to return mockDataFrameReader when spark.read is invoked

One more behavior has to be added where a DataFrame is returned when spark.read.table is invoked, for that we create a stub object

Now let’s add the required behavior

Let us complete our test by adding an assertion.

We have completed our unit test case by mocking Hive read for spark.

Mocking Write
We will build on what we have learned so far. Start with mock

Now add behavior

doNothing is used for void methods

Kafka Stream Mocking

The streaming source needs a different set of classes to be mocked like org.apache.spark.sql.streaming.DataStreamReader and org.apache.spark.sql.streaming.DataStreamWriter

Here is the sample test case using the above steps

  1. Mock
  2. Behavior
  3. Assertion

Conclusion

Mocking helped us achieve greater than 80% code coverage and is recommended for all JVM-based spark pipelines.

Happy Learning

--

--

Saurabh Shashank
Walmart Global Tech Blog

Staff Data Engineer, International Data @Walmart Global Tech