Nobody wants to have the wrong estimates in a business critical analysis or introduce errors or unexpected behavior in production. So how do we avoid that? When we have limited time and resources and want the most bang for our bucks.

In this article we will discuss the great benefits of creating tests on the distributional properties of the results first.

Other types of testing on your data science pipelines are relevant too, like input validation or checking how transformations work on known data sets.

But the easiest tests to make, and that catches the most errors are distributional tests…


TL;DR: PyCharm’s probably better

Image by Gerd Altmann from Pixabay

Jupyter notebooks have three particularly strong benefits:

  • They’re great for showcasing your work. You can see both the code and the results. The notebooks at Kaggle is a particularly great example of this.
  • It’s easy to use other people’s work as a starting point. You can run cell by cell to better get an understanding of what the code does.
  • Very easy to host server side, which is useful for security purposes. A lot of data is sensitive and should be protected, and one of the steps toward that is no data is stored on local machines. …


As a data scientist and consultant, we do discuss a lot of ideas involving data science. Ideas of type:

Data -> Data Science -> Valuable results

Let’s look at some questions that can help you evaluate if this is a good idea. Not all of these questions needs to have a positive answer — there is no perfect project. It’s about managing risk vs. potential, how your work will enable better ideas in the future and making good bets. These are approximately what I try to ask when planning large projects or evaluating new initiatives. …


ICE is a framework for selecting feature work based on 3 key dimensions, enabling teams to make better decisions on what features work to prioritise:

  • Impact: How much will this change our business (if it works)
  • Confidence: How confident are we that this will work?
  • Ease: How much work is it to implement?

Rate each feature on scale from 1–10, multiply to get the score

ICE = (Impact) * (Confidence) * (Ease)

The feature with the highest score wins. For more information, see for example here or here.

This scheme might seem simple, but that is also one of the upsides. Estimation is really hard, so trying…


Summary: Because most data science tasks have uncertain outcomes, it is easy to create a lot of work in progress that doesn’t go anywhere. Using Kanban enables us to keep track of that work in progress. We also find it to be a very lightweight tool for project management, with minimum ceremonies for our estimation work.

We are assuming you are already looking at an agile process methodology. As our main background is consulting, we are often working on early phase and immature problems where the level of confidence is naturally low, and with many, open solution spaces explore.

Some…

Steffen Sjursen

Data scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store