Writing PySpark Unit Tests

Justin Davis
4 min readFeb 11, 2022

Unit testing allows developers to ensure that their code base is working as intended at an atomic level. The point of unit testing is not that the code is working as a whole, but instead that each individual function is doing what it is supposed to be doing. Here I will lay out a basic framework for writing Pyspark unit tests. My workflow includes Pyspark, Pytest, and Chispa.

First, to set the groundwork for how to test different modules inside a project directory. I made a simple stock analysis project with the following layout:

run_tests.sh
stock_analysis
—| filter_stocks.py
-| aggregate_stocks.py
tests
—| conftest.py
—| test_filter_stocks.py
—| test_aggregate_stocks.py

filter_stocks.py

I wrote a basic function to filter stocks to be over $400.

from pyspark.sql import DataFrame, functions as Fdef filter_above_400(df: DataFrame):
return df.filter(F.col("price") > 400)

aggregate_stocks.py

I wrote two functions inside this file. One function aggregates by stock and returns the maximum price all time. The other function aggregates by stock and returns the minimum price all time:

from pyspark.sql import DataFrame, functions as Fdef get_max_stock_price(df…

--

--