How to write Spark unit tests
We all know that testing your code is the best approach to ensuring the quality of your product. Writing automated tests is a well-understood part of most software engineering projects. However, in data engineering, they can often be overlooked. A common reason for not creating unit tests for data pipelines is that they are complex and challenging.
In this article, I will show you an approach to writing Spark unit tests that will enable you to keep the quality of your data pipelines high.
What is unit testing?
Unit testing is the process of testing small standalone parts of your code in isolation. Unit testing contrasts with other types of testing that check more significant features of your application, such as integration and system tests.
Unit tests live alongside your code, with engineers usually writing them in the same programming language. Running these tests on your dev machine allows you to check your work as you write it. The tests can also be part of your CI/CD solution, ensuring the quality of your deployed application.
In this example, we will be using Python and a testing framework called pytest, but the principle is valid across other languages and tools.
Writing the unit test
Now that we know what unit tests are, let’s create a small PySpark project with some unit tests.
Setting up the environment
If you have an existing PySpark project, you can skip this step.
Here, I’ll be using poetry as our dependency and package manager. Poetry is an excellent alternative to pip for managing virtual environments in Python.
Let’s create a Python module and install our dependencies:
poetry new pyspark-unit-test # Setup your python module
poetry add pyspark # Add pyspark as a dependency
poetry add --dev pytest # Add pytest as a dev dependency
We now have an empty module, and we can start to build our Spark logic and unit tests.
If you have followed these steps, you will have a folder structure that looks like this:
The PySpark job code will live in the pyspark_unit_test
directory and poetry has helpfully created us the test file: test_pyspark_unit_test.py
.
PySpark job
A simple PySpark job reads a CSV file and transforms the specified columns to uppercase before writing out the results.
Splitting the job into several transformations that take a DataFrame as a parameter, performs the transformation, and returns a DataFrame is key to writing testable jobs.
Writing unit tests
Here is a unit test for the simple transformation shown above.
There are a few interesting things to highlight here:
pytest.fixture(scope="session"):
Using this pytest fixture means we can pass the spark session into every unit test without creating a new instance each time. Creating a new Spark session is an intensive process; using the same one will allow our tests to run more efficiently.- Test data: We are using a specific two-row example as a small dataset is all we require to prove the test case. As a result, we can create the DataFrame using lists. I suggest using a directory of test data files in CSV or JSON format if you require larger datasets.
- Expected data: To validate the test case, we need to have desired results. Again, we use lists, but you may prefer to use some files.
.collect():
Spark has two types of operations, Transformations and Actions. Transformations are added to the directed acyclic graph (DAG) and don’t return results until an Action is invoked. As our function doesn’t perform any Actions, we need to call.collect()
so that our transformation takes effect.- Assertion: To ensure our job code is performing as expected, we need to compare the results of our transformation against our desired data. Unfortunately, Spark doesn’t provide any DataFrame equality methods, so we will loop through row by row and compare our results to the expected data.
In a nutshell
Unit testing is achievable when writing data pipelines with Spark. By separating the data read/write from the transformation logic, you can validate the behaviour of each transformation to ensure that your pipeline will behave as expected.
You can find the complete source code at:
Thanks for reading! Please share your Spark testing experiences in the comments.
Interested in joining us?
Credera is currently hiring! View our open positions and apply here.
Got a question?
Please get in touch to speak to a member of our team.