Power is Nothing Without Control
Don’t break up with Jupyter Notebooks. Just use Kedro too!
This article is suitable for anyone who has found themselves seduced by the ease of working with Jupyter Notebooks. Although it’s aimed at readers who are relatively new to data science, it applies equally to more experienced data scientists & engineers who are considering how to improve their daily workflow.
The convenience of a Jupyter Notebook combined with Kedro’s software best-practice
I’m going to run through the reasons why we fall in and out of love with Jupyter Notebooks, and describe how the open source Kedro framework can help you solve some of the problems that cause headaches and heartaches for so many data professionals. And there’s no breakup involved! You can still use Notebooks alongside Kedro if you can’t let them go entirely.
If you are unfamiliar with Notebooks, or want to learn a little more about them, here’s a brief historical digression, summarised from a brilliant article called 10 reasons why data scientists love Jupyter Notebooks. If you already know all you need (or want) of their history, just jump to the next section “Why use a Jupyter Notebook”.
Back In 2011, IPython Notebook was released as a web-based interface to an interactive Python console. It allowed Python developers to manipulate code, text, data plots, and more, on a local web page. The IPython Notebook rapidly became so popular that it was expanded to embrace other programming languages such as R and Julia.
In 2014, the Jupyter project was announced to steward IPython Notebooks for multiple languages under a new name, while IPython continued as a Python shell.
Wikipedia explains the project’s choice of name:
“Project Jupyter’s name is a reference to the three core programming languages supported by Jupyter, which are Julia, Python and R, and also a homage to Galileo’s notebooks recording the discovery of the moons of Jupiter.”
The Project Jupyter website describes itself as “a non-profit, open-source project… evolved to support interactive data science and scientific computing across all programming languages”. The project has thrived since inception and, in 2017, was awarded the 2017 ACM Software Systems Award.
It’s fair to say that Jupyter Notebooks have become an essential part of a data scientist’s toolkit.
Why use a Jupyter Notebook?
❤️ It’s understandable why we love Notebooks ❤️
When you start out in data science or data engineering, there’s so much to learn. There are plenty of resources to help you out, including online courses. Most offer example projects, and these are most likely to be in Notebooks because of their simplicity for beginners. They make it particularly easy to pick up code, step through it, plot data, and experiment. You are guaranteed to find it simpler to get started in a Notebook than to set up an IDE and use it with a script. You simply take a Notebook, and as long as you have run pip install Jupyter
you can get going. Each time you run a Notebook cell, you immediately see the results of the code. What’s not to like?
There’s no denying that, however experienced you may be, when you’re starting out on any project, Notebooks are useful for experimentation and visualisation. The honeymoon wears off as you scale up to production or need to collaborate with others. Even then, there are ways of working with Notebooks: for example, the Netflix data team have described how to scale-up your Notebook usage. But, for many, the adage that “Power is nothing without control” springs to mind.
Why not use a Jupyter Notebook?
💔 It’s also understandable why we fall out of love with Notebooks 💔
Across the data science web, and even on this blog alone, there are a number of articles that detail the reasons why Jupyter Notebooks become problematic on larger projects. As soon as you reach a point where you need to use software best practice to control your complexity, it becomes clear that Notebooks don’t support those ideals.
As an ex-C programmer, I recognise these as pitfalls that come from a powerful, yet liberal, development environment. For example, as your code gets larger, it is difficult to manage within a single Notebook. The volume of code gets tangled. And sometimes, a single Notebook isn’t sufficient anyway. Say you want to run a number of models together, and compare the results? You’ll need a separate Notebook for each. How do you set them up with identical configuration, which you can change consistently as your experiment progresses?
Say you want to test with a new set of data or use a different method of processing your existing data, or input different parameters to your algorithm. You can change your Notebook, but you cannot track those changes easily and roll them back, if you introduce a source of error into your Notebook, it is not straightforward. The Notebook format doesn’t lend itself well to source control, and debugging can be tricky.
Another issue that hits data scientists in teams is that of environment reproducibility. Suppose you want to collaborate with a colleague and share your Notebook. But what if they are using a different version of a particular package, such as sklearn
, that has changed over time? How do you specify to your colleague what environment to set up so they can run the Notebook just as you do on your machine? How do you automate that setup? Going further: you want to deploy the code to a number of people in the team as part of a demo. But how do you manage configuration and updates? Don’t even think about how to make a release and product-ise your Notebook!
Best practice: How to combine a Jupyter Notebook with good engineering
Instead of throwing out your Notebooks completely, why not consider using them for experimentation (e.g. for exploratory data analysis) while structuring your project within a framework?
I previously introduced Kedro in a post on this blog back in 2019, although we have created some simpler ways of learning about Kedro since then over on readthedocs.io.
Kedro was created by QuantumBlack, an advanced analytics firm, and it was open-sourced in 2019 to extend its use to third-party data professionals. It is a development workflow tool that applies software engineering best practices to your data science code so it is reproducible, modular and well-documented. If you use Kedro you can worry less about learning how to write production-ready code (Kedro does the heavy lifting for you) and you’ll standardise the way your team collaborates, allowing you all to work more efficiently.
Although there is a learning curve to Kedro because you’ll be working with Python scripts and a command line interface, it begins to pay back rapidly as you reap the benefits of organised code and data. A good time to pick it up is as you start a new project, but you can also transform an existing Notebook to a Kedro project, as a recent video from DataEngineerOne illustrates.
Once you have set up a Kedro project, you can use a Jupyter Notebook to develop your code if you prefer that way of working, and then transform it into a Kedro node
, which is a Python function that is called by the Kedro framework in its role of managing the pipeline
. (The terminology is explained further in the Hello World example in the documentation, which you’ll work through as part of setting up your Kedro installation). There are some neat extensions that allow you to work in a Notebook while developing your Kedro project, and you can find a video of how to use two tricks to make Jupyter Notebooks more useful with Kedro.
In summary
The path of least-resistance may be appealing but a disciplined approach to development pays off in the long-term.
While Notebooks help you get up and running fast, your project may run into problems when you try to scale it because of their lack of support for versioning, reproducibility, and modularity.
Kedro is a framework to structure your code that draws on the experiences of a host of data scientists across a range of industrial projects. It is opinionated to keep you aligned with software best practices, but it also allows you the option of using a Jupyter Notebook for experimentation, and eases the transfer of code from the Notebook into the Kedro project.
Don’t trust my word for it though. Check out the reasons teams are using it and, if you want to find out more, head over to kedro.readthedocs.io or check out the growing list of articles, podcasts and talks about Kedro!
With grateful thanks to and of QuantumBlack Labs for valuable insights as I wrote this piece, and the entire Kedro team for their willingness to share what they know 💚