Introduction to Creating Unit Tests for PySpark Applications Using unittest and pytest Libraries

Mete Can Akar
Plumbers Of Data Science
7 min readOct 22, 2023

TL;DR: Software testing, and in particular, unit testing, is a crucial step in modern Data Engineering. Pytest and unittest are great tools for developing unit tests for PySpark applications. In this article, I provide code examples using both libraries. Also, I discuss the advantages and disadvantages that each of them brings. The choice depends on your needs and previous experience.

A detective preventing/destroying bugs with unit tests (created by DALL-E)

1. Introduction

2. What is Testing in Software Development?

2.1 What are the Test Types?

2.2 Understanding Unit Testing and Its Importance

2.3 What are the Purposes of the unittest and pytest Libraries?

2.4 How Can I Execute These Tests?

2.5 Summary and Comparison

3. Conclusion

1. Introduction

In this article, I will talk about how we can create unit tests for PySpark applications using Python unittest and pytest libraries. Before jumping into the topic, it makes sense to talk briefly about what testing is in Software Development, what types of tests exist, what unit testing is and why it is even needed?

2. What is Testing in Software Development?

Testing in software development refers to the process of evaluating and verifying that a software program or application works as intended. It’s a crucial step in the software development life cycle, ensuring that the software meets specified requirements and is free of defects or bugs that could adversely impact its performance, reliability or security.

In this article, I won’t go into details of the benefits of testing and Test Driven Development (TDD). But as a data engineer/data scientist/ML engineer (i.e., data person) you should care about the quality of the software you are building. However, I will still briefly talk about the importance of unit testing below.

2.1 What are the Test Types?

There are numerous types of tests in software development, each designed for a specific purpose and stage in the development process. The most commonly implemented test types are [1]:

  • Unit Testing: These are the smallest tests which validate that an individual component (e.g., function) of the software works as intended. That’s why it is expected to run the unit tests in the Continuous Integration (CI) pipeline such as after each commit or when a PR is opened depending on your DevOps workflow.
  • Integration Testing: These tests check if multiple components work well together. The number of integration tests is less than unit tests but higher than E2E tests. They also take longer to execute than unit tests but take less time than E2E tests. Integration tests can also be included in the CI pipeline (e.g., when PR opened into develop and/or master). Since the decision of when to run which test and which branch strategy to use are DevOps-related topics, I won’t dive into this topic for now.
  • End-to-End Testing: Checks the application flow from beginning to end making sure that everything works well. This is very costly and most of the time done manually (especially at the beginning of the development) in the data engineering world. It is also possible to automate E2E tests but the execution time will still be much longer than other types of tests. Be aware that, in the Web and Mobile Development contexts E2E testing is different where the goal is to simulate real user actions using various tools.
Pyramid illustrating the hierarchy of test types by volume and their increasing execution time from base to apex

In the pyramid above, as we move toward the apex, the time to execute a test increases, while closer to the base there is an expectation to implement a larger number of tests.

It is important to note that there are other testing methods such as smoke test, performance testing and so on. However, we will focus solely on unit testing today. Otherwise, this post could become a book.

2.2 Understanding Unit Testing and Its Importance

Unit testing is one of the foundational aspects of software development. At its core, a unit test is a piece of code written to test a specific function or module in isolation from other parts of the application. The primary goal is to validate each unit (e.g., a function).

Several reasons underscore the significance of unit testing:

  1. Quality Assurance: Unit tests ensure that individual components of the application work as expected. This minimizes the chances of introducing bugs when making changes or adding new features.
  2. Regression Detection: As software evolves, there’s a risk that changes can inadvertently introduce errors in previously working code. Unit tests act as a safety net, catching regressions before they reach production.
  3. Documentation: Well-written unit tests can serve as documentation. Developers can look at the tests to understand what a particular function is supposed to do and how to use it. This is especially important when existing members leave the project and new members join. Thanks to the unittests the handover process will be much smoother.
  4. Facilitate Refactoring: With a solid set of unit tests in place, developers can refactor or restructure code with confidence, knowing that any regressions will be quickly identified.
  5. Development Efficiency: While writing tests might seem like an extra task, they actually speed up the development process in the long run. With tests in place, developers can make changes without fearing that they’ll break existing functionality. Even more importantly, thanks to a high test coverage we can sleep well at night!
  6. Collaborative Development: In team environments, unit tests ensure that changes made by one developer don’t inadvertently break functionality added by another.
  7. Stakeholder Trust: Consistent quality and reliability foster trust with stakeholders, ensuring they can depend on your analyses and pipelines.

In the context of PySpark applications, unit testing is even more critical due to the distributed nature of Spark. Errors might not be visible until the code is run on a cluster with large datasets. Even worse, the code might be running without any error or exception but it might have introduced some bugs which can only be understood by taking a deep look at the results. By writing unit tests, developers can catch and fix these issues during the development phase, ensuring that the application runs seamlessly and reliably to minimize the risks.

2.3 What are the Purposes of the unittest and pytest Libraries?

  • unittest is a built-in Python testing library that follows the xUnit style and provides a test framework along with test discovery [2].
  • pytest is a popular third-party testing library for Python that offers a flexible and concise way to write tests [3].

Alright, enough talk. I can hear you saying,

Source: [4]

Therefore, let’s walk through various test cases in order to get familiar with both unittest and pytest libraries. In each example below, you will find the following structure:

  • Implementation
  • Pytest test case/cases
  • unittest test case/cases
  1. Here is a very basic unit test example to test a basic summation operation using Python without Spark.
  • unittest requires a more strict structure, such as a TestClass inheriting from the unittest.TestCase class.
  • Also, with unittest it is required to use its assert methods such as assertEqual. However, with pytest you can just use the Python’s built-in assert statement for Pytest and assertEqual method for unittest.

2. Simple Data Enrichment Transformation using PySpark

  • This example covers 2 test cases.
  • When developing pytest tests, fixtures are used instead of setUpClass methods in unittest.
  • You can create numerous fixtures based on your needs instead of putting everything inside the setUpClass method.
  • For assertions, I used the chispa library’s assert_df_equality method [5]. You could also use the normal assert on the collected results. But I prefer the assert_df_equality method as it saves some code and provides clean test result messages.

3. Simple Aggregation Using PySpark

  • This example covers one aggregation example.

The gists are created based on the following repository: https://github.com/metecanakar/unittesting-for-pyspark-apps. You can take a look at it if you want to dive deeper.

2.4 How Can I Execute These Tests?

  • Running a single file, module or class via CLI:

Pytest:

pytest tests/unittests/test_basic_sum.py

Unittest:

python -m unittest -v tests.unittests.test_basic_sum

or

python -m unittest tests/unittests/test_basic_sum.py
  • Discovery mode via CLI:

Pytest:

(venv) metecanakar@Metes-MacBook-Pro unittesting-for-pyspark % pytest tests                                     
================================
platform darwin -- Python 3.9.6, pytest-7.4.2, pluggy-1.3.0
rootdir: /Users/metecanakar/PycharmProjects/unittesting-for-pyspark
collected 10 items

tests/pytests/test_basic_sum.py .. [ 20%]
tests/pytests/transformations/test_apply_basic_aggregations.py . [ 30%]
tests/pytests/transformations/test_basic_data_enrichments.py .. [ 50%]
tests/unittests/test_basic_sum.py .. [ 70%]
tests/unittests/transformations/test_apply_basic_aggregations.py . [ 80%]
tests/unittests/transformations/test_basic_data_enrichments.py .. [100%]

=================================

Unittest:

(venv) metecanakar@Metes-MacBook-Pro unittesting-for-pyspark % python -m unittest discover -s tests/unittests

------------------------------------
Ran 5 tests in 5.961s

OK
  • Or just run them via your preferred IDE

2.5 Summary and Comparison

I compared and summarized these 2 libraries’ advantages and disadvantages in the table below.

Pytest and unittest comparison

3. Conclusion

This was an introduction to testing in software development and in particular unit testing for PySpark applications. Of course, there are many more details to be covered such as mocking, pytest plugins, pytest conftest.py which is used for sharing fixtures across multiple files etc…Please let me know if you are interested in more advanced topics and I will try to cover these.

To conclude, both libraries are great for developing unit tests for PySpark applications and in general for Python applications. You should choose depending on your needs and familiarity in both tools. If you are just starting with unit testing in Python, I would personally recommend to start with the unittest library first, especially if you already have experience with JUnit. Then move to pytest once you feel more comfortable.

Resources

[1] https://microsoft.github.io/code-with-engineering-playbook/automated-testing/e2e-testing/testing-comparison/

[2] https://docs.python.org/3/library/unittest.html

[3] https://docs.pytest.org/en/7.3.x/contents.html

[4] https://quotefancy.com/quote/1445782/Linus-Torvalds-Talk-is-cheap-Show-me-the-code

[5] https://github.com/MrPowers/chispa

--

--

Mete Can Akar
Plumbers Of Data Science

Senior Data Engineer with DS/ML background. Follow me on https://www.linkedin.com/in/metecanakar/. Opinions are my own and not the views of my employer.