Breaking down Data Science problems with Test Driven Development

Using tests to break down and analyse our existing proofs.

Published in

Sainsbury’s Data & Analytics

7 min readJan 8, 2020

Waffle ready for breaking down — from Warsaw

I recently spoke at pyData Warsaw about how Test Driven Development (TDD) is often considered tedious and difficult to integrate into a Data Science workflow. This is a follow up on a previous blog post that describes an approach I use regularly within the Sainsbury’s Applied Data Science and Machine Learning team. This approach is founded on my experience on one of the UKs largest video websites: BBC iPlayer. Data Scientists approach code differently from engineers, with more prototyping and data analysis involved. Test Driven Development is a powerful tool in large organisations, to show the purpose of Data Science. Approaching code differently also means approaching testing differently. To begin introducing testing into Data Science requires lowering the barrier to entry. This means embracing and building upon the already built notebooks.

Pokémon and Sainsbury’s Data Science aren’t so different

In this post, I want to give a practical guide to breaking down a Data Science notebook. Uncle Bob, one of the advocates for TDD, has a good introduction for the cycles involved. With that in hand, let’s use a notebook everyone can access. In this post we will use “Predict result for Pokémon battles” by Alexssandro Oliveira. This is a great example of a typical piece of work being transferred from a proof of concept into a production pipeline, because:

It’s neatly refactored into functions so we can infer a fair amount from reading. This is the final step for every proof of concept in Sainsbury’s.
Sometimes the business is not ready for the model. We leave models for an unknown length of time until we are ready to integrate it into a production system.
Data in a proof of concept can be inadequate for production, meaning we need to wait for other teams to resolve that dependency.

It’s important at this point to draw parallels between the typical software prototype and a notebook. Both are for exploratory work and proving out ideas. Production code needs to work for the business to operate. Running prototype code in production is a risk for business operations. Prototypes are proving grounds which do not need to have automated tests or work beyond a small subset of a problem. The same is true for notebooks. You can write a lot of code in a notebook to prove an idea before rewriting it as a production pipeline.

Breaking down the prototype within the notebook

Breaking down this entire notebook in one go is difficult. The test shown above is where I stopped. Working from the outermost layer of tests is important to create tests that break down the business logic. As we work inwards, we focus towards naturally smaller pieces of code that make sense for us to test. Now that there is a partial outer test for the notebook, let’s look at the apply_pipeline function in the notebook. Working from the outside in lets us break down the code in a natural way rather than trying to design it all up front.

This test is now focussed on the feature transformations. I ran combats.head() to generate a sample data set quickly, then used to_csv to quickly generate test data. This replicates the notebook act of checking data as you go along. In this case, TDD is being used to ensure we break down the notebook, and to ensure it works as it did previously.

Breaking down the first section of the apply_pipeline function

Starting from a function I haven’t seen before is sometimes difficult. In this case it’s safer to assume a sensible order of operations and break it down line by line. This highlights how to start writing tests which test the broken down sections.

From this test, you can naturally see the small chunk of the work and what it does.

Breakdown of the combats_with_columns.csv

And these small fixtures are easy to interpret, just as you would in a notebook. You can now write many tests to test this function with different data. There is only the limit of your imagination for the permutations of these tests. But don’t worry if you cannot catch them all, nobody can imagine all possible data.

Now this pattern repeats, with different tests. In the above tests, there are two new approaches taken. Preprocessed data is repurposed as the input data for the add_type_columns function. A large subset of the data is passed to add_win_columns to ensure it is representative. This shows how understanding the context is really important when writing any software and especially if you write tests using this approach. The tests become the holder of the intimate knowledge that Data Scientists have about their data.

Re-visiting the notebook with more understanding

Second attempt at breaking down the first section of the apply_pipeline function

As you write these tests, the underlying structure of the function can become clearer. This is where tests become less tedious for me when I’m working day-to-day. Now, I don’t have to over-analyse large chunks of work. The natural question that I was asking was: “why do I need this preprocessed data frame so much?”. This led to the second breakdown shown above. Now we can see that we can break the preprocessing up into those more natural pieces.

add_speed_column with preprocessing built-in

add_speed_column with a bit of refactoring

Taking the two pieces of code and putting them together creates these small functions which do their work in isolation.

Putting these together leads to a much leaner feature transformation function which is easier to understand. Once the feature transformation test passes, we are confident our production features work like our proof of concept. This is a great example of why I love TDD. Because it gives me this outer loop that verifies my work is exactly as we originally built it.

This shows the rapid approach to rolling out these data-oriented tests. Additional tests should of course be added for different data inputs. Traditional testing would start with the most basic case, the empty data frame. This is a great example of a test you should have in production to ensure your code is robust to error cases. Other error cases can be encapsulated in further tests of this function. This is helpful for fixing production bugs as you can start from a failing test. Gamification of bug fixing!

Testing the whole notebook - including the model

Finally, it is important to test the model. I did this using the ping pong game I wrote about previously. This test is another data snapshot of the output from the notebook to ensure we fixed the random seeds. We can now prove we can achieve exactly the same predictions as in the notebook. Accuracy is also tied to the notebook, which states 96% accuracy. With the assertion of >=0.95 the accuracy can vary but will always fall within our acceptable limit. In the case of a new version of a library impacting the model, you can see if it changes the accuracy and compare the snapshot it produces in terms of predictions.

In a world where explainability and understanding of models is so important, TDD ensures we can test every aspect of it easily. By taking a faster approach we can quickly deliver Data Science into production as the business requirements change, without compromising on stability and robustness. Here at Sainsbury’s we have so much data and so many options of what to do with it, we need to respond quickly to new data becoming a reality.