Preventing Data Quality Issues with Unit Testing

8 min readMar 22, 2023

Using Pytest to write robust unit tests to prevent data quality issues

Unit testing is a program testing technique that involves testing individual units of an application in isolation from the rest of the system. It ensures that each unit of code functions as expected and meets its design specifications.

During unit testing, software engineers, data engineers, data scientists, data analysts, or analytical engineers write test cases that evaluate the functionality of a single class or function within the code, using mock data to simulate input and output. The tests are designed to check for errors, bugs, and other issues that could cause the unit to fail, and they are run repeatedly as code changes are made.

Why is unit testing important for data quality?

Data quality is all about ensuring that data is accurate, consistent, and reliable. Unit testing is an essential practice in data quality because it helps to ensure that data processing and transformation functions which are the fundamentals of the data processing pipelines are working correctly and producing accurate results. Here are some reasons why unit testing is important in data quality:

Ensuring accuracy: By testing individual data processing and transformation functions, we can ensure that they are producing the expected results for a given input.
Detecting data anomalies: Unit testing can help detect data anomalies, such as missing or duplicate data, in the early stages of data processing.
Reducing errors: Unit testing can help reduce the number of errors in data processing by identifying and fixing defects in the code.
Enhancing data quality: By ensuring that each data processing and transformation function is working correctly, we can enhance the overall quality of the data.

Pytest

Pytest is a Python testing framework that originated from the PyPy project. It can be used to write various types of software tests, including unit tests, integration tests, end-to-end tests, and functional tests. Its features include parametrized testing, fixtures, and assert re-writing.

Pytest is not a built-in Python package. You need to install it within your terminal with the pip install pytest command.

Best Practises

Before we dive into examples I want to share some best practices in unit testing and pytest that will help you to develop better unit tests with pytest.

Unit Testing Best Practices

Write test cases for both positive and negative scenarios: Ensure that the test cases cover both expected and unexpected scenarios, including edge cases. As the product evolves the code base needs to change as well. Do not forget to main both of the scenarios in the meantime.
Test only one unit at a time: Individual units should be tested in isolation from other units to ensure that any issues or errors can be traced back to that specific unit. Some units will be more complex than others. Isolating units will grant focus on each one of them.
Run tests automatically: Automating tests can save time and reduce the chances of human error. Automatically executing unit tests in the CI/CD pipelines, make sure that unit tests are evaluated before the code is deployed to production.
Keep tests small and focused: Tests should be small and focused on a specific aspect of the code. The intention of unit testing is to test certain behaviors at a time. The smaller the test can be the more detailed it will be.

Pytest Best Practices

Split your source code from test code: Having separate folders for source code and test code supports a cleaner and scalable code base.
Organize your tests into test functions: Pytest uses Python functions to represent test cases. Each function should test a specific behavior or functionality of your code.
Use descriptive function names: Use descriptive names that explain what each test function does. The best approach is getting a source functional name and adding “test_” as a prefix. This will help everyone to understand which source code is being tested.
Use test fixtures: Test fixtures are functions that set up the environment for your tests. They can be used to initialize variables, create test data, or set up the test environment. Use fixtures to avoid code duplication and to ensure that tests run consistently.
Use parameterized tests: Pytest allows you to define parameterized tests, which are tests that run multiple times with different inputs. Use parameterized tests to test different input scenarios and to avoid code duplication.
Use assertions: Assertions are statements that check if a condition is true or false. Use assertions to check that your code behaves as expected. If an assertion fails, Pytest will report an error.

Pytest Examples

Example 1 — Testing the source function with one test function

In this example, I created a basic function to create the upper case of a given string.

def upper_case(sample_text: str): #sample_text is the input text object
    output = sample_text.upper() #sample_text is converted into upper case
    return output #the new object is returned

At the unit test, I will create a mock string to test my function.

from source import source_code #importing source_code from other folder


def test_upper_case():
    test_text = "AbCd123" #mock text object to be used at testing

    test_output = source_code.upper_case(test_text) #applying function

    assert test_output == test_text.upper() #asserting function

I am executing the test function over my terminal.

pytest test/test.py

The output shows that the test case passed successfully;

In order to showcase the error statement I will change the test function.

from source import source_code


def test_upper_case():
    test_text = "AbCd123"

    test_output = source_code.upper_case(test_text)

    assert test_output != test_text.upper() #this assertion will fail

The output shows that the test case failed;

Example 2 — Testing the source function with two test functions

In this example, I am going to use the same source code and the test code above but I will create a new test function for it and test it with two test functions. With a purpose, I will make the first test fail and the second pass to show the difference in the outputs.

from source import source_code


def test_upper_case():
    test_text = "AbCd123"

    test_output = source_code.upper_case(test_text)

    assert test_output != test_text.upper() #this assertion will fail


def test_upper_case_string_length():
    test_text = "AbCd123"

    test_output = source_code.upper_case(test_text)

    assert len(test_output) > 0

The output shows that one test case passed and one test case failed;

Example 3 — Parametrize testing with a list of test cases

In the previous examples, I created a single test object to test a single function. If I want to test multiple test objects with a single test function, I can use pytest.mark.parametrize().

I am creating a test_text list object to contain two text objects and passing them to the test function;

from source import source_code
import pytest

test_text = ["AbCd123", ""]


@pytest.mark.parametrize("sample_test_text", test_text)
def test_upper_case_string_length(sample_test_text):
    test_output = source_code.upper_case(sample_test_text)

    assert len(test_output) > 0

In this example, I am also going to test only a single function in my text file. In this regard am executing the test function over my terminal.

pytest test/test.py::test_upper_case_string_length

The output shows that one case passed and one test case failed;

Example 4 — Using fixtures to test multiple test functions on a single test object

In pytest, fixtures are functions that provide a fixed baseline of data or test objects that are used as a basis for testing other code. Fixtures are essentially a way of defining and managing resources that are needed by tests.

Fixtures are defined using the @pytest.fixture decorator. The decorated function provides the setup code that is needed to create the fixture. When a test function requests the fixture as an argument, pytest automatically calls the fixture function and passes its return value to the test function.

In this example, I will pass the test text in a function and use fixtures to pass it to multiple test cases at once.

from source import source_code
import pytest


@pytest.fixture
def test_text():
    return "AbCd123"


def test_upper_case(test_text):
    test_output = source_code.upper_case(test_text)

    assert test_output == test_text.upper()


def test_upper_case_string_length(test_text):
    test_output = source_code.upper_case(test_text)

    assert len(test_output) > 0

The output shows that both cases passed;

Example 5 — Pandas data frame testing

In the real-world from data scientists to data engineers are mostly working on Pandas data frames on certain data preparation operations; e.g. data clearing, feature engineering, etc.

In this regard, I will dive into more on the data frame test cases. In the examples, I will use the movies.csv file. A quick look into the data set;

Below I will run multiple tests to validate the data frame;

import pytest
import pandas as pd


@pytest.fixture
def sample_data_frame():
    return pd.read_csv("movies.csv") #uploading dataset as a fixture


def test_column_names(sample_data_frame): #testing column names matching with the given list
    assert list(sample_data_frame.columns) == ["title", "rating", "year", "runtime"]


def test_unique_ratings(sample_data_frame): #testing rating types matching with the given list
    assert list(sample_data_frame.rating.unique()) == [
        "R",
        "PG",
        "PG-13",
        "GP",
        "Not Rated",
        "NC-17",
        "G",
    ]


def test_year_minimum_year(sample_data_frame): #testing minimum value for the year column
    assert sample_data_frame.year.min() >= 1900


def test_year_maximum_year(sample_data_frame): #testing maximum value for the year column
    assert sample_data_frame.year.max() <= 2023


def test_missing_values(sample_data_frame): #testing null values

    assert sample_data_frame.isna().sum().sum() == 0

Conclusion

Unit testing is the first line of defense to prevent data quality issues. If we can’t ensure that our programs are working as expected in the first place, whatever data quality, data observability, or data reliability product will not solve our problems. In this regard, every data professional who is using Python should test their codes before deploying to production.

Thanks a lot for reading 🙏

If you liked the article, check out my other articles.

If you want to get in touch, you can find me on Linkedin and Mentoring Club!