The when and how of Jupyter notebooks in Data Science projects

Jupyter notebooks are very appreciated by Data Scientists, sometimes overused. When and how should you use notebooks instead of regular code? Here my thoughts after years of experience.

8 min readDec 19, 2021

Raise a hand if you are a Data Scientist and:
- you re-run a notebook and get different results
- you could not re-run a notebook because of some error
- you went back to a Jupyter notebook with important results, and found that actually there was a mistake in the code
- you have developed a successful Proof Of Concept (POC) using Jupyter notebooks but then bringing it to production was a terrible struggle
- you had to make a little modification to the algorithm built in your Jupyter notebook, and it was way more pain and time than expected
- you had a hard time reading through a Jupyter notebook that used some custom function/classes

Either you are a spider, or I bet you did not have enough hands to raise, right?

If you read until here is probably because you sense that there is something that you could improve in the use of Jupyter notebooks. Perhaps you use them in inappropriate situations, or you should use them differently, or a combination of the two. Well, you are in good luck, this article is exactly about this topic. Unfortunately this article is not the ultimate truth, but only my organized thoughts after years of experience. But I am confident that what you will read will trigger some thinking on your side that will help you on this matter. If you will share your thoughts in the comments, I and the other readers will greatly appreciate!

For the sake of brevity, in the rest of the article I will simply say “notebook” instead of “Jupyter notebook”.

The good

Results, fast!

That’s the common saying, and it is definitely true: if you want to do some exploration/prototyping and get results as fast as possible, then go for notebooks. Have you ever reasoned why is it so?

My opinion is that you are faster because you are in a permanent debugging mode. Whatever you code can be checked immediately. If something goes wrong, no need to re-run all the code from scratch, just the last cell. If it is useful, you can visualize your results or run extra inspections that you can delete later on. In short, you can code incrementally: one iteration at-a-time.

Notice that by coding incrementally, you are also typically writing code in a serial way, by this meaning that one piece of code in a notebook depends only from code above it. As said, typically, not necessarily. But this is actually what leads to the next good side of notebooks

Code as an illustrated novel

Clean Code should read like well written prose. But often it will lead to the same feeling you may have when reading philosophy: “Ok, so what?”, “Where should I start reading?”(which file? which class?). “How is this piece of code supposed to be used and be useful?”.
Code per se does not come with examples and graphs with result, even though, of course, you can include them through tests and examples.

Notebooks come to the rescue: they should read like well written novels. Easy to follow, concrete, and even with pictures. You can read notebooks from top down, no questions asked. They include the results you are interested in. Readability at its best, even for dummies.

The bad

No code abstractions

Sometimes, while working on a notebook, you will want/need to add code abstractions, such as functions or even classes. Typical examples for data scientists: clean_data(), evaluate_model(), and plot_results().

Assuming that you know your stuff, and you avoid using global variables in these functions, and you define them in tested .py files that you just import in notebooks, there is still one major problem. Try to read the following code snippet and see if you can anticipate me.

What is the problem? The problem is that such code snippet could apply to pretty much anything! When I read the notebook I would like to know: how are you cleaning the data? Which model are you using, and which hyper-parameters? How do you split training and validation set? I could go on, but I think you got my point. Remember: a notebook should read like a well written novel, from the top to the bottom. Here, I need to interrupt my reading to look up at these function definitions.

Notice that you don’t have this problem when you are using python packages like pandas or tensorflow. The problem occurs specifically for functions defined ad-hoc for the notebook you are working on. If you think about it, it makes sense: packages are built to work in a variety of contexts, they abstract it. Therefore all the context-specific information will have to be provided and will be visible in the notebook.

If you can agree with my point of view, you will conclude with me that you have two options if you want to keep the high-class readability of notebooks:

notebook code should consist of simple scripts. The keywords def and class are banned.
abstractions (functions and classes) should be imported from general-purpose python packages.
.py files next to the notebook won’t do, sorry.

And this is what is intrinsically bad about notebooks: if you want to benefit from its good sides, you should avoid introducing ad-hoc abstractions.

The ugly

When I started writing this article, I had many drawbacks in mind. But then I did my research and found out that there are a lot of solutions to address them. Still, I consider them ugly, as they require extra efforts that make you lose, to a certain degree, the speed advantage of notebooks.

Versioning and code reviews

Notebook files contain a lot of meta-data, and this makes it very difficult to compare changes in, say, a git merge request. A simple solution is Jupytext, which will keep your .ipynb file in sync with a plain .py file. In this way it will be easy to keep track of code changes. But not of the cells output, and sometimes this is important too, for example if your notebook is an analysis report for the business, where the cells’ output is actually more important than the code itself. Imagine the situation where you re-run the same notebook, just with a new, more up-to-date input file, and want to look at the differences… Here nbdiff-web comes to the rescue, but it requires some extra efforts.

So here there is a big distinguish to make: is your notebook a stand-alone data analysis (explorative notebook), or it is meant to be a script to be integrated to a production pipeline (script notebook)? In the first case you better use nbdiff-web. In the second case, go for Jupytext, or, even better, Ploomber, which integrates it and add many other useful features.

Tests

As already discussed, usually in notebooks you don’t write tests, but rather debug on the way. However, if you later on make changes, or use objects in a way which was not intended at the beginning, you may end up in trouble. Automated tests here would come to the rescue. Can you do it? Ploomber provides a solution, which eventually amounts to automated integration tests. What about unit tests? Options are doctest and testbook. Not a perfect solution, you say? Well, I tend to agree, especially because I am an advocate of Test Driven Development, and these solutions are not handy enough for me. About this last statement, I recommend this blog post from Robert C. Martin.

Notice however, that if you agree with me in “The bad” section, you will not introduce functions/classes, and therefore unit tests won’t be really needed. The integrations tests of Ploomber are all you will ever need.

Reproducibility

You take the notebook from someone else, re-run it with supposedly the same input, and obtain different results. Breath slowly and calm down. How can you avoid to end up here? requirements.txt, .python-version, dvc, and Dockerfile are your friends (in case you would like me to explain this in more details, leave me a comment). But there is one issue left: the hidden state. If a notebook is not the result of “restart the kernel and re-run all cells” without any further change, you cannot be sure that the notebook is reproducible. Problem is that often re-running some cells can last hours.

Here we need again to distinguish between explorative notebooks and script notebooks. In the second case I believe that Ploomber’s approach is to be recommended: automate re-running from scratch with a small data set. In the first case what matters the most are the cell outputs, therefore you want to use the full data set. In this case probably a good strategy is to use a small sample set while developing, and once you are satisfied with the result, re-run all from scratch with the production data, commit, and push.

If you want to learn more tricks on how to work with notebooks, I definitely recommend this article from the Ploomber team (you find more on their blog) and to consult the Matt Wright’s blog.

My takeaways

Notebooks open up for unbeatable development speed and readability. Unfortunately this comes with a price to pay: you should avoid the temptation of introducing code abstractions (functions or classes) within a notebook. The code inside notebooks should be flat, like in a script.

We have also discussed why code abstractions should be introduced in proper general-purpose packages. This will

make notebooks more readable, because all the problem-specific information will be plainly visible in the notebook (e.g. model name, manually tuned hyper-parameters, etc.)
allow the complex portion of code to be open for refactoring, thanks to proper automated unit tests (hopefully created using Test Drive Development).

Along the way we have learned of two possible use cases for the notebooks:

explorative notebooks: one-shot experiment, where what matters the most is the output of the cells, as this will guide informed decisions. In this case we better
- use nbdiff-web to compare different versions during code reviews
- if possible, develop the notebook with a small data set or toy-problem. Before committing the notebook to git, restart the kernel and run all cells with the full data set or problem.
script notebooks: notebooks which are actually intended as a script with integrated visual capabilities that will be part of a (production?) pipeline. In this case we should use
- Ploomber, to deal with proper versioning, testing, reproducibility issues, plus more.

In all cases, having requirements.txt, .python-version, dvc-pushed data, and (when necessary) Dockerfile, is a must to ensure reproducibility.

Final remark

My conclusions are a result of taking the importance of readability to its extreme consequences.
Even if you agree, please don’t be orthodox: pragmatism is an important skill!

Practicality beats purity [from the Zen of Python].