Unit Test Your Data Pipeline

Introduction:

3 min readAug 17, 2020

One common mistake that data scientists, especially beginners, make is not writing unit tests. Data scientists sometimes argue that unit testing is not applicable because there is no correct answer to a model that can be known ahead of time or to test with. However, most data science projects start with data transformation. While you cannot test model output, at least you should test that inputs are correct. Compared to the time you invest in writing unit tests, good pieces of simple tests will save you much more time later, especially when working on large projects or big data.

Benefits of Unit Testing

Detect bugs earlier: Running big data projects is time consuming. You don’t want to get an unexpected output after 3-hour running when you could have easily avoided it.

Easier to update codes: You will be no longer afraid of changing your code because you know what to expect and you can easily tell what is broken if it is broken.

Push you to have a better structured code: You will write cleaner codes and prefer to write in DAGs instead of linearly chaining functions when you keep in mind you are gonna test your codes with isolated pieces. (use d6tflow to build data science workflows easily)

Give you confidence on the outputs: Bad data leads to bad decisions. Running unit tests gives you confidence on data quality. You know your code outputs what you want it to output.

Pytest

To improve testing efficiency, use Pytest. If you are looking for tutorials on Pytest, I would recommend Dane Hillard’s post Effective Python Testing With Pytest. In his post you will find out how to utilize basic and advanced Pytest features.

Unit Testing for Data Science

Depending on your projects, what you want to check with unit testing will be different. But there are some common tests you would wish to run for data science solutions.

1. Missing values

#catch missing values
assert df['column'].isna().sum()<1

2. Duplicates

# check there is no duplicate
assert len(df['id'].unique())==df.shape[0]
assert df.groupby(['date','id']).size().max()==1

3. Shapes

# have data for all ids?
assert df['id'].unique().shape[0] == len(ids)

# function returns have shapes as expected
assert all([some_funtion(df).shape == df[0].shape for df in dfs])

4. Value Ranges

assert df.groupby('date')['percentage'].sum()==1 
assert all (df['percentage']<=1)
assert df.groupby('name')['budget'].max()<=1000

5 Join Quality

d6tjoin has checks for join quality.

assert d6tjoin.Prejoin([df1,df2],['date','id']).is_all_matched()

6. Preprocess Functions

assert preprocess_function("name\t10019\n")==["name",10019]
assert preprocess_missing_name("10019\n") is None
assert preprocess_text("Do you Realize those are genetically modified food?" ) == ["you","realize","gene","modify","food"]

What Makes pytest So Useful?

If you’ve written unit tests for your Python code before, then you may have used Python’s built-in unittest module. unittest provides a solid base on which to build your test suite, but it has a few shortcomings.

A number of third-party testing frameworks attempt to address some of the issues with unittest, and pytest has proven to be one of the most popular. pytest is a feature-rich, plugin-based ecosystem for testing your Python code.

If you haven’t had the pleasure of using pytest yet, then you’re in for a treat! Its philosophy and features will make your testing experience more productive and enjoyable. With pytest, common tasks require less code and advanced tasks can be achieved through a variety of time-saving commands and plugins. It will even run your existing tests out of the box, including those written with unittest.

As with most frameworks, some development patterns that make sense when you first start using pytest can start causing pains as your test suite grows. This tutorial will help you understand some of the tools pytest provides to keep your testing efficient and effective even as it scales.

I hope it will help you to develop your training.

No matter what books or blogs or courses or videos one learns from, when it comes to implementation everything might look like “Out of Syllabus”

Best way to learn is by doing!
Best way to learn is by teaching what you have learned!

Never give up!

See you in Linkedin!

Oscar Rojo Martín - Studing Data Science at Universidad de Deusto - San Sebastián, Basque Country, Spain | LinkedIn

www.linkedin.com

References:

* https://www.kdnuggets.com/2020/08/unit-test-data-pipeline-thank-yourself-later.html 
* Coauthored with Haijing Li, Data Analyst in Financial Services, MS Business Analytics@Columbia University.