Spark: Mocking Read, ReadStream, Write and WriteStream
Low code coverage of the spark pipeline is caused due to missing unit test cases for sources and sinks. Most of our spark scala pipeline’s code coverage was below 70%. We want it to be in the range of 85–95%.
Spark supports multiple batches and stream sources, sinks like
- CSV
- JSON
- Parquet
- ORC
- Text
- JDBC/ODBC connections
- Kafka
Limitation of writing test cases for spark read and write brings down overall code coverage and therefore a low overall quality score for the pipeline. As read and write are an omnipresent part of the data pipeline, the corresponding lines of code are a significant percentage of the total lines of code.
Let’s look at the step-by-step guide to improving code quality measures of spark scala pipelines.
Below is a summary of mocking concepts
Mocking in unit test case writing is to test the behaviors and logic of written code
Mock is used for behavior verification where it gives full control over the object and its method
A stub is used to hold data that are mostly limited in nature and used to return during unit test cases
Spy is used for partial mocking where only select behavior needs to be mocked and the remaining needs to be the original behavior of an object
Let’s get into spark specifics
Maven based dependencies
Let’s use an example of a spark class where the source is Hive and the sink is JDBC.
DummySource.scala reads a table from Hive, applies a filter, and then writes to a JDBC table.
We skip transform where other logic is applied as that can be tested using stub
We would not need to write unit test cases where the code connects to the source or sink. So here we want to mock the read and write behavior of spark.
Before we look at the solution, it is better to get ourselves familiarised with Spark DataFrameReader and DataFrameWriter
Since we are going to Mock the below two classes it’s important to know all the methods/behaviors
Mocking Read
The objective here is to avoid making the connection to the source and still get a DataFrame so that our filter logic is unit tested. So we start by mocking the SparkSession
SparkSession.read returns org.apache.spark.sql.DataFrameReader, so now we will mock DataFrameReader
We have all the required mock objects. It is time to control the behaviors. We need mockSpark to return mockDataFrameReader when spark.read is invoked
One more behavior has to be added where a DataFrame is returned when spark.read.table is invoked, for that we create a stub object
Now let’s add the required behavior
Let us complete our test by adding an assertion.
We have completed our unit test case by mocking Hive read for spark.
Mocking Write
We will build on what we have learned so far. Start with mock
Now add behavior
doNothing is used for void methods
Kafka Stream Mocking
The streaming source needs a different set of classes to be mocked like org.apache.spark.sql.streaming.DataStreamReader and org.apache.spark.sql.streaming.DataStreamWriter
Here is the sample test case using the above steps
- Mock
- Behavior
- Assertion
Conclusion
Mocking helped us achieve greater than 80% code coverage and is recommended for all JVM-based spark pipelines.
Happy Learning