Jupyter Notebook is the Cancer of ML Engineering

Published in

Skyline AI

11 min readOct 16, 2019

In recent years since the democratization of data science spearheaded by the development in the ML open-source community with libraries like SciKit-Learn, TensorFlow, PyTorch and Pandas, Jupyter Notebooks have become the de-facto standard for data science research and development. In fact, Jupyter Notebooks themselves have played an important role in the democratization of data science, making it more accessible by removing barriers of entry, as we will soon see.

Like the Python language itself, the most common Jupyter Kernel, which is a very loose language in a lot of aspects, Jupyter notebooks are an ideal environment for hacking on data science.

Combining all of the benefits of Jupyter Notebooks, the main value that emerges is that using Notebooks is an easy way of crafting a data story. For this reason, it’s easy to treat notebook code in a “this is just for research” approach, a behavior encouraged by the lack of tooling and standards around ML engineering and Jupyter Notebooks (for instance, it's extremely rare to see a data scientist running a notebook inside a docker container just to make sure the external libraries used are dockerized and not coming from their own machine).

One Man’s Heaven is Another Man’s Hell

The problems begin when this story needs to interact with a production application. The fun & easy platform used by the data scientist needs to be integrated into a production-grade data pipeline. That’s where nearly all of the benefits of Jupyter become drawbacks, turning the life of the ML Engineer into a living hell. If the company does not have an ML engineer, almost every single line of code written on Jupyter can become a bad case of tech debt, slowing the company’s time to market and plaguing it with hard to debug issues both in the research and the production phase.

By the time you notice this, it’s possible that your entire data science team is deep into using notebooks and that hundreds of these things are currently being used sporadically to create some of the data you are using to make decisions in production, ignoring pretty much every important lesson that was learned in the world of production engineering and DevOps in the past 20 years.

What is Jupyter Notebook and Why It Has Been So Successful

There are many reasons for Jupyter being so successful, most of them are related to removing barriers — or rather making things like documentation, data visualization and caching, a lot easier — especially for people not coming from a hardcore software engineering background.

Let’s explore some of them.

Inline Printing of the Output in the EDA Process

Notebooks allow the data scientist to view the results of their code in-line, sometimes without any dependency of other parts of their code. Contrary to working with a standard IDE like VSCode or PyCharm, on Jupyter, every cell, an executable unit of code, can potentially be called at any time to draw an output right below the code.

This is extremely useful for the Exploratory Data Analysis phase in any data science research process, where the data scientist needs to play around with the data, fetching it, changing it’s structure, slicing and dicing it, creating aggregations, and in general, learning about correlations and probability distributions, to develop intuitions about the nature of the data that needs to be modeled.

For example, this is how you would easily present results in-line on Jupyter. The first cell (called In [12]) prints a JSON output (output[12]) and the second one prints a heatmap based on colored circles to help the data scientist understand the geographical distribution of average Medicare payments in the US:

Want to dissect your data into bins and plot a histogram (a histogram represents the distribution of data by forming bins along the range of the data and then drawing bars to show the number of observations that fall in each bin)? One line of code gets you the output right below your code and some text describing what you are doing:

Built-in Cell-Level Caching

Another huge benefit (we will later understand why this could become a great pitfall) of using Jupyter is that Jupyter automatically, behind the scenes, maintains the state of execution of each individual cell.

Caching is hard, especially if your software engineering skills aren’t that top notch. If you are a mathematician trying to get some data to model stuff, the last thing you want is to implement your own caching schemes.

Jupyter solves this magically by caching the results of every cell that is being run. For instance, if you have a huge notebook, containing hundreds of cells of code, and somewhere at the beginning of the notebook (located earlier on the notebook’s “execution flow”), you have, say, some code that is doing some heavy operation — like training an ML model or downloading gigabytes of data from a remote server, with Jupyter, you have zero concerns about caching the results of those operations.

Normally, if you would write Python code, you would implement a simple caching scheme or use some library (for instance, by writing the results of the trained model to a storage bucket, or, by saving the results of the DB query into a local file).

On Jupyter, once you have run the cell once, until you decide to clear the cache, you can simply avoid re-running it to enjoy global scope caching of its last execution. That is, if the cell had a line of code that does something like this:

df = client.query(merge(‘SELECT * FROM MyTable’)).to_dataframe()

The `df` variable will hold the data assigned to it until the cache is explicitly cleared or the cell is re-run. Potentially, hundreds of cells below, you could call df to get the value, even if the last time you have run the cell that populates it was months ago.

ML Engineering and Production-Grade Data Pipelines

We’ve seen that Jupyer notebooks are great for telling data stories. But, unless we are doing pure research, our research is the means to an end — and that is, getting valuable insights from our data stories and ML models.

A production-grade ML pipeline needs to be composed out of debuggable, reproducible, easy-to-deploy, high-performance code components that could be orchestrated and scheduled. In its default version, it’s everything research code on Jupyter isn’t.

Why Is Jupyter’s Automatic Caching So Dangerous

Even if you ignore the fact that cell caching encourages polluting the global scope with cached variables (see the assignment to df above), Jupyter notebook’s default caching behavior forces the developer to remember the entire state machine of the notebook in their heads.

Which cell is cached with what? Out of 400 cells, could cell # 273 run even if cell #50 did not run? What if it did run, but on a different data, reading different data? Are cell #200’s results immutable? I.e. can we re-run it and get the same results, or will a re-run fail/return different results?

There is no easy way of answering any of those questions while developing on a Jupyter Notebook.

You have to rely on your human memory, which isn’t designed to remember state machines by heart, to know which cell could run with/without other cells, which cell is re-runnable, etc. This just doesn’t work.

Basically, what most data scientists do when they are unsure about the state machine, is just trigger a complete re-run of the entire notebook, which is a complete waste of time and resources.

State Machines aren’t meant for humans to remember by heart

But worse — what happens when you want to take the research code to production? Either the notebook developers themselves or ML engineering needs to make the research code organized into production-grade code components (remember — debuggable, reproducible, etc.). The different bits of the research code — the parts that fetch data, the parts that train model, the parts that validate the results, etc., need to be orchestrated and we need to run them in some kind of a DAG form (see Apache Airflow).

But when you have a “data story” laid out with hundreds of cells, and the only thing you can pretty much rely on (and this too is up to the developer who wrote the notebook to decide), is that the state machine advances in a straight line “forward”, “chronologically” as you scroll down in the notebook… you are in deep trouble.

Local Execution

Jupyter notebook’s user-friendly CLI encourages local execution. It’s pretty much certain that 99% of data scientists run Jupyter locally, although there are some premature attempts by some of the cloud providers to migrate the notebooks to the cloud (Google Cloud DataLab and AWS SageMaker). But those solutions are premature and suffer from plenty of their own problems.

There are many issues with running notebooks locally that complicate things for the ML engineer that would take the notebook’s code to production. First of all, the dependencies. The notebook may import libraries that are only installed on the data scientists computer.

If, by any chance, the data scientist is using a different version of the same library, like NumPy to TensorFlow, the actual mathematical calculations may end up being different in production than they are in research. This is one of the biggest nightmares of every engineering operation and one of the main reasons for the success of container technologies like Docker, which generally aren’t being used when working with notebooks, because, well, they are “just for research”.

Further, notebook results that were received from a local execution aren’t reproducible. The notebook does not go through a build process and does not have to comply with CI standards enforced by a build server. It may very well be that the results would be different when running on another team members machine just because of some caching differences between the machines.

Of course, the performance of a local machine isn’t ideal as well, but there are ways around that (for instance, one can connect to external clusters to offload the execution of heavy calculations there).

Standard Libraries and Docs Lowering Coding Standards

There are a lot of practices around research that use great libraries from serious companies like Google to remove friction on the expense of writing good code.

A great example of that is the Google Cloud Big Query cell magic. Cell magics in Jupyter are code libraries that can extend the functionality of Jupyter Notebook cells to do things like automatically save their results to a data structure.

Consider the following example from the official Google Cloud docs, Visualizing BigQuery data in a Jupyter notebook:

%%bigquery total_births
SELECT
    source_year AS year,
    COUNT(is_male) AS birth_count
FROM `bigquery-public-data.samples.natality`
GROUP BY year
ORDER BY year DESC
LIMIT 15

Looks nice, right? Simply hard code an SQL query in a cell and have it split the resultset out to a pandas dataframe called total_births. while it’s clear why this is easy to work with, this is actually very bad coding practice for anything other than pure research.

If your Notebook uses this cell magic all around, you cannot dynamically create queries using generic functions. So that if you have two complex queries that need to be identical except for a small part, you will find yourself duplicating those queries.

Without cell magic, the average programmer would write a generic function to return the query with the small change:

def get_births_by(period):
   return """SELECT
    source_period AS {period},
    COUNT(is_male) AS birth_count
FROM `bigquery-public-data.samples.natality`
GROUP BY {period}
ORDER BY {period} DESC
LIMIT 15""".replace('{period}', period)by_month = get_births_by('month')
by_year = get_births_by('year')

To be able to enjoy this cell magic, most data scientists would create duplicate cells, one for year:

%%bigquery total_births
SELECT
    source_year AS year,
    COUNT(is_male) AS birth_count
FROM `bigquery-public-data.samples.natality`
GROUP BY year
ORDER BY year DESC
LIMIT 15

And one for month:

%%bigquery total_births
SELECT
    source_month AS month,
    COUNT(is_male) AS birth_count
FROM `bigquery-public-data.samples.natality`
GROUP BY month
ORDER BY month DESC
LIMIT 15

This type of code duplication is obviously very bad, and it is most likely that an ML engineer integrating this sort of research code to a production-grade application would have to refactor the entire notebook’s code to use generic functions that avoid duplicating code, not using any of this code.

The Illusion of Stability

With Jupyter’s built-in caching, and the fact that the results are saved after the notebook is executed, it’s easy to be fooled into a sense of stability.

Surely, if I see the results in front of my eyes underneath one of the cells, then re-running the same cell without changing any of my code should produce the same results. Right? Wrong.

If the cell produces a result based on data in a DB, and that data has changed, and there is nothing in the code to make sure the code works with a particular snapshot of the data that does not change, the exact same cell may return a different result. And, once the cell has been re-run, it’s very hard to know what the previous result was — because Jupyter notebook does nothing to maintain a history of the results, and source control solutions like git typically do not store caches as well.

Like many examples in this write up, this isn’t an issue special to Jupyter, but it is a good example of how reproducibly of previous results isn’t possible without having proper tooling to create them, and since Jupyter doesn’t supply this tooling, a data scientist without great software engineering skills will likely suffer from this.

Solutions

So, can we somehow enjoy both worlds? Having Jupyter-style flexibility and caching on one hand, and state-of-the-art coding practices on the other?

It seems that the answer is yes, or, it soon will be: the two most popular IDEs of today for all things data-science, PyCharm and VSCode, both now support Jupyter natively (although support is still far from perfect).

Using PyCharm and VSCode’s native support for Jupyter, you can easily integrate source control solutions and leverage productivity features like IntelliSense, real-time error handling (in case you have syntax errors or if you break pylint conventions), Git integration, multi-file management and more, offering a proper way for data scientists to experiment and work with data efficiently.