Testing practices for data science applications using Python

jane saldanha
Data Science at Microsoft
9 min readSep 28, 2022

By Jane Saldanha and Nile Wilson

Photo by Soroush Karimi on Unsplash.

Increasingly, data scientists are working with cross-disciplinary teams to implement models that enhance applications. Our strengths are typically in the exploratory and developmental phases of projects such as data exploration, preprocessing data, and model training. However, as we move downstream into ML engineering, it is crucial for us to implement software development best practices like version control and unit testing to ensure code functionality and integration.

The aim of this article is to make data scientists aware of and comfortable with testing practices that can be followed for Machine Learning applications when moving from development to production. In this article, we walk through the basic concepts of testing, see how we can implement these stages using either of two Python packages (pytest and unittest), and walk through an example implementation of testing of a regression model.

Basic concepts of testing

Software testing is a way of validating whether a software is performing as a customer or product consumer would expect. There are various levels of software testing, each of which we will cover in this section and as shown in the figure below. To help solidify the concepts, let’s supplement the explanations tying back to a non-code system that many of us may be familiar with, a bicycle.

Unit testing is the first level of testing, in which the independent modules and functions are tested as a check on whether they are operating as expected. It is essential to test parts independently because it reduces the cost and time of diagnosing bugs in later stages of testing. For a bicycle, things such as ensuring a tire maintains proper pressure or that the bell chimes could be consider unit tests.

Integration testing is the stage at which individual functions are combined and tested together to check whether they interact with each other and produce the correct result. For a bicycle, this could be turning the pedals to ensure the wheels spin.

System testing involves testing the entire system or product. For a bicycle, this could be turning the handlebar while the pedals are moving to ensure the bicycle turns.

Acceptance testing is done to evaluate whether the system’s operation aligns with the business requirements. In the bicycle example, there may be various types of user needs. Both a road bike and a mountain bike may pass system testing but have different user requirements.

Test coverage is a metric used to determine the percentage of the application code that the test cases are covering, primarily relying on unit tests. Coverage determines whether the product testing is effective, and, in turn, if the resulting product is of good quality. Test coverage helps identify bugs in the applications at an early stage by highlighting areas of the codebase that are not covered by the test cases and helps optimize testing by removing redundant cases or test cases that are no longer applicable. In terms of the bicycle example, this could be an inspection checklist showing every component that does not have some sort of quality or safety check already in place.

In this article, we focus primarily on the fundamental levels of testing, namely unit testing and integration testing such that data scientists feel empowered to develop tests for code they may write as part of a larger engineering solution.

Testing frameworks provided by Python

As mentioned above, Python provides two frameworks for testing: unittest and pytest. Let’s explore these two libraries.

unittest

Python uses the unittest library as its default testing library. Test cases created using this library go under a class that is inherited from unittest.TestCase. It supports the aggregation of tests into a collection for test automation.

The commands used for executing test cases written using the unittest framework are as follows:

  1. For command line options:
    python -m unittest -h
  2. Running tests from various modules using CLI:
    python -m unittest test_module_1 test_module_2
  3. Executing tests from one test class:
    python -m unittest test_module.TestClass1
  4. Executing individual test functions:
    python -m unittest test_module.TestClass.test_method

Pytest

Pytest, inspired by Java’s Junit library, is a lighter framework that requires less code because of its rich features. It can be used for a wide range of testing such as API testing, user integration, database testing, and more. Pytest provides coverage reports on the terminals that highlight the statements across the included files that are not covered by the test cases, the statements that are covered, and the percentage of the coverage. It also provides the status of the test cases for the entire project. Other formats in which the report can be generated are HTML, XML, and annotated source code.

The commands used for executing tests using the pytest framework are as follows:

  1. Executing a test file:
    pytest filename.py
  2. Executing a particular test case of a file:
    pytest filename.py::testcasename

Example

Consider the following function that needs to be tested.

The corresponding test case for the function in unittest is as follows:

As you can see in the above snippet, in order to write test cases using the unittest library, we require a class suite that inherits from the unittest.TestCase package. Unittest makes use of the built-in assert function that tests a particular condition. If the assertion fails, it results in an assertion error.

Now, let’s look at the test case for the same function square() using pytest:

The snippet above is more concise and readable compared to the test case written using the unittest framework. Assertion is done using the keyword assert, Boolean operators (==, !=, <= ), and membership operators (in, not in) depending on the condition to be tested.

One similarity between pytest and unittest is the nomenclature of test cases: They both must start with test_ or end with _test to be recognized by as a test case during execution time.

Mocking: An important concept in testing

Mocking is a process of replacing an object in place of a dependency that has certain expectations. These expectations could involve returning some mathematical computation, validating a submethod, or even connecting to an external environment (such as Azure commands, for example). While writing unit test cases, it’s important to make sure that they are independent of the other functions. Hence, it is crucial to mock all the dependencies to ensure that the function being tested is working properly.

Consider the following function that has a dependency on the square function:

Now let’s explore how to write unit test cases using both the frameworks. The unittest framework provides a mock library, which is imported using the code from unittest import mock.

Let’s explore how the test case would look when written using pytest framework.

The library pytest-mock provides a mocker fixture that is used to patch the dependent functions. Patching means replacing the dependent function with another function that mimics the functionality or returns the value we choose. We will address what fixtures are in the next section.

Fixtures

Fixtures are functions that run before each test function to which it is applied. They are passed as a parameter to the test case and are used to feed some data to the respective test case. The data passed could be URL, dataframe, or database connection, among others. A function can be declared as a fixture using @pytest.fixture. Consider the function:

In order to test the above function, we would have to mock the new_add function, which is responsible for the addition of the new column. We can use the fixtures to add a new column while mocking the function. The fixture and the test case are given below.

Full example of data science test code

To demonstrate data science code testing, let’s consider a simple regression problem where we are predicting the Cost of Living Index using various parameters such as Rank, City, Rent Index, Cost of Living Plus Rent Index, Groceries Index, Restaurant Price Index, and Local Purchasing Power Index (Jazz4299/PYTEST (github.com)).

The regression problem is executed using process.py file in the src folder. It first calls the data ingestion module, which expects a path and returns an error if the path mentioned is incorrect or does not exist. Then, the plot module is called, which generates plots for exploratory analysis and preprocesses the data (including the removal of null values, splitting the column city into city and country columns, and encoding categorical values). The last module is the training modelbuilding, which is responsible for splitting the dataset, training the model, predicting the output, and creating the visualization of the model performance.

Let’s look into the testing of this data science problem.

Unit testing using pytest

The dataloading.py file has one function that contains two scenarios — one in which the path provided is correct and another in which an incorrect path is inputted.

Let’s look at the scenario where the path provided is correct:

Every test case begins with test_ or ends with _test for pytest to recognize it as a test case. The test case calls the function load_data() which is the function being tested, and checks whether the object returned is a dataframe. When the path provided is incorrect the function returns an exception, and it can be handled using the pytest.raises() as shown in the snippet below.

We can see a decorator on top of the test case, pytest.mark.parametrize, which helps in declaring multiple sets of arguments and fixtures for the test function.

The plots.py file has two functions that store a histogram and a heatmap. These functions are called in the _init_ function, which means that to test one of the functions, the other one must be mocked.

Let’s explore the test case for hist_observation(). As shown below, the function heatmap is mocked using the mocker fixture. Then, the test case checks whether the file histogram.png is generated in the location in which it was intended.

The class ModelBuilding calls split, train, and predict functions in the _init_ function, and instantiates a Plot object. To test the above functions, the Plot object must be mocked. This can be achieved using the mocker.patch.object() function. The test case for test_split() is given below:

Integration tests

When it comes to integration tests for data science code, how the actual test is triggered depends on the structure of your codebase and the solution in production. Recall that unit tests check that the individual functions work as expected, while integration tests ensure that functions interact with one another as expected while the whole solution executes.

Integration tests, like unit tests, should be run before pushing code to your production environment. With CI/CD, this could look like automatically triggering integration tests every time a pull request is created or updated, and blocking the merge of a feature branch into the environment (dev, QA, or prod) until all tests pass. So, what does an integration test look like? Integration tests are essentially running your whole solution on a small subset of your data (like a toy dataset) so that you can observe the behavior and ensure everything works as expected. A best practice is to use a subset of data to ensure this runs faster than using the full dataset. In an MLOps scenario, the integration tests may look like a yml file that specifies the main orchestrator Python file to run and its inputs. In the code snippet below, we see that our main Python file (CostOfLiving/src/process.py in the sample repo) runs the entire solution and takes in a few different input arguments, including which csv file to read in as the input data source.

Instead of feeding in the full dataset csv file, we would specify a smaller toy dataset csv file in the yml file. To run the integration test during CI/CD, the yml file would be used to execute the code with the toy dataset, creating the integration test. While we do not cover pipeline construction and setting up an MLOps repository in this article, we note that this can be done in a variety of platforms including, but not limited to, GitHub, GitLab, and Azure DevOps. Some examples applicable to solutions using Azure services may be found in the Microsoft Azure Machine Learning documentation.

Conclusion

In this article, we have explored the basics of testing data science code and provided a real-world example that shows the principles in practice using the Python testing frameworks unittest and pytest. We hope that with these explanations and examples that cover the various scenarios that may arise during data science code development, you feel empowered to develop your own tests for data science applications.

--

--