Test-Driven Development in Data Science

Published in

MiQ Tech and Analytics

6 min readNov 23, 2020

Much of a data scientist’s work is exploratory in nature. A good data scientist must know how to develop good hypotheses, reject those which aren’t working, and work with those which look good. It’s a very iterative process.

The consequence of this is that it’s counterproductive to restrict the data scientist with too many restrictions on process and rules. Jeremy Howard (the co-founder of fast.ai, among other things), for e.g., suggests that data scientists following the PEP 8 convention may not be optimal.

But there is danger in going too far in this direction. Any science relies on method and process for the accumulation of knowledge. As the Big Data, Analytics and Data Science industry is maturing, there is a need for more professionalism in the process. The need for agility shouldn’t compromise the quality of the code that is being written.

This blog will focus on one important learning that the Data Science industry can pick up from Software Engineering — the framework of Test-Driven Development or TTD. However, because of certain fundamental differences in both the fields, there are certain modifications that must be made.

This article has the following outline:

What is TTD?
TTD for Data Science
Use of TTD in Data Science

What is TTD?

A developer must follow certain steps during the development cycle, such as coding of the feature, having a clear understanding of what the end result is supposed to be, ensuring that the feature is bug-free and edge cases are handled, and so on. TTD is a framework that defines the best practices for development.

In TTD, the basic approach is:

The developer must understand the feature and its requirements thoroughly before starting to code.
Now, instead of starting with coding the test immediately, they should write a test first. It must be ensured that the test fails (as a test that would pass even before the code is written would be meaningless).
Only now the code must be written so that the test passes. The developer must not write code more than what is necessary to pass the test.

Then the next test must be written, and further code added to pass the test and so on. Keep in mind that the code here can be inelegant. The only important thing is to write tests and add code to pass the tests.

The growing codebase must be regularly cleaned up by the developer by refactoring the code. By running the tests and ensuring the refactored code also passes the tests, the developer can be confident of what they’re doing.

There are several benefits to this approach:

Flexibility: with the code base growing, and multiple people working on multiple features, test cases ensure that the old code continues to work as new code is being added. Also, developer A doesn’t have to worry that their code breaks the code developed by developer B as A can run tests to check that everything is working fine. As a result, the codebase remains flexible
Documentation: because unit tests are isolated in nature, they help us understand what the code intends to do in a simple manner. So if A faces some issues because of a feature developed by B, it’s convenient to look at the test cases, their input and output data, and the edge cases being handled
Minimal debugging: as we are verifying the code throughout the process. Therefore, there will be very few bugs remaining after the feature has been developed.
Better design of the code: breaking the feature into small isolated pieces of code organically leads to a better and simpler implementation
Edge cases: writing tests forces the developer to think of cases which may cause failures and need to be handles

TTD for Data Science

Much of a data scientist’s work involves dealing with uncertainty. Unlike software development, a data scientist’s work involves experimentation with data without a clear idea of how the end result is going to look like.

Let’s take an example. TTD works best when testing things where a piece of code gives a predictable output for a given input. For e.g., in digital advertising, there are certain base metrics (such as Impressions, Clicks, Conversions, etc.). From these base metrics, we derive certain KPIs (such as Cost per Click, Conversion Ratio, etc.). A unit test that tests code written to calculate the KPIs from the base metrics is best suited for TTD.

A data scientist, however, often works with building multiple models and checking how well they predict for a given set of inputs. Because there is no fixed outcome that we can expect, a test cannot be written beforehand.

This doesn’t mean that TTD isn’t applicable to Data Science, though. There are several reasons why despite this issue, TTD is required in Data Science:

Data processing: the preprocessing steps in data science get all the benefits that TTD provides. For e.g., we can check that steps such as filtering, removing columns/rows with a large number of missing values, feature engineering, etc. are working
Working in teams: a data scientist may not be the only one working on a modeling project. So structuring code, writing simple asserts, unit tests, etc help teams
Building a pipeline: in most companies, a data scientist is not only involved in the modeling of data. They may also work in building features using the results of modeling or analysis of data.

Use of TDD

Let’s look at pytest, which is a unit testing framework in python. Python has a few popular unit testing frameworks such as unittest, doctest, and pytest. In recent years, pytest has become very popular. It has some distinct advantages for data scientists such as simple asserts, very little boilerplate code, strong support for test fixtures, etc.

There are a few common tools that a data scientist must possess. The use of these few tools generally suffices for the needs of a data scientist.

Fixtures

In unit tests, fixtures are functions that run before the test and feed some data to the test. For example, let’s say there are several tests that have dataframe df as one input. We can create a fixture that either creates df or reads it from somewhere else. Now df can be used as an input to the different tests

Mocking

A unit test is an isolated piece of code. That is, it shouldn’t be dependent on other functions to run. To be more specific, let’s say function fun2 calls function fun1. However, issues in fun1 shouldn’t cause the tests of fun2 to fail. As a result, instead of calling fun1 directly in fun2’s test, what we’ll do is “mock” it.

In short, mocking is creating functions that mimic the behavior of actual functions. So, in the example mentioned in the last paragraph, we’ll test fun2 using a mocked fun1. A mocked fun1 would return a specific out when given specific inputs, nothing else. It won’t run lines of code of fun1 to generate the output. Hence, we are only testing fun2 assuming fun1 behaves as expected

Parametrize

This allows us to test multiple scenarios through one test function. For e.g. let’s say a function can have one of several KPIs as an input. Through parameterize, we can pass the different KPIs to the test function, and check whether the function works for all of them

Coverage

To find how much of the code is covered by unit tests. If the tests pass through each line of the code, then 100% of the code is covered. But let’s say there is an if-else statement. While the test goes through the ‘if’ part of the function, the test’s data is such that the ‘else’ part isn’t tested. In this case, the code under the ‘else’ part wouldn’t come under the coverage

Endnotes

It often happens that data scientists who have practiced most of their coding in interactive environments like Jupyter Notebooks tend to pick up some bad habits which may become an issue when working in a corporate setup.

One good habit is to use standard functions for common processes like cleaning up of data, instead of each data scientist in a team writing their own pieces of code. Another good habit is to keep verifying that the code is bug-free. This can be done via assert statements or unit tests. Otherwise even the proof-of-concept code tends to become difficult to debug.

As a result, while following a TTD completely would be both impossible (because you cannot test the accuracy of a model) and useless (as much of the code will need to be discarded as if it fails to deliver results) for a data scientist. However, a suitably modified form of TTD is very important for data scientists.