From notebook hell to container heaven

Kris Peeters
datamindedbe
Published in
8 min readNov 9, 2021

This article is the first chapter of a 3-parts tutorial:

Here’s a bitter pill to swallow in modern Machine Learning. Companies are investing heavily in ML. But most real-life ML applications, while looking shiny (no pun intended) in the frontend, are often nothing more than a bunch of notebooks duct-taped together, which are hard to maintain and hard to add new features to. This slows development teams down and stops innovation.

We’ve all been there!

💔 Notebooks are often a love / hate affair

There has been much debate online about notebooks and their pros and cons. The most hilarious talk against notebooks you can find here:

https://www.youtube.com/watch?v=7jiPeIFXb6U

But then Jeremy Howard, the creator of Fast.ai, steps in to fight for notebooks here:

https://www.youtube.com/watch?v=9Q6sLbz37gk

In the end notebooks are just tools. There is no one-size-fits-all. And any tool is typically really good for one purpose and really bad for another purpose.

Where notebooks typically shine…

Here is where we love to use notebooks in our day-to-day work:

  • 💨 Quick iterations and tight feedback loop: Nothing better than just banging out some random Python code in a notebook to test out an idea.
  • 📈 Data Exploration + Early Visualisation: A picture says a 1000 words. A notebook combined with tools like Seaborn gives you quick and powerful insights into your data
  • 🤖 Building Machine Learning models: And yes, even model building is nice in a notebook. You want to try out different ideas and see what works. As long as it fits in a notebook, why would you spin up complex distributed compute jobs?

Where notebooks typically lack

Here are things that are typically a struggle in notebooks:

  • 🤐 Secrets management: Production doesn’t run on your laptop so you typically can’t read data from a local path.
  • 🧱 Modularisation: Build reusable code. And no, importing one notebook in another doesn’t count. ;-)
  • ✅ Testability: I’m yet to find my first live unit test in a notebook
  • 🔁 Reproducibility: Python version, Dependencies, Hard-coded paths, Environment specific configuration, Order of execution of cells, Hidden state, … For a data scientist, reproducibility should always be high on your agenda.
  • 📜 Versioning and Collaboration: Notebooks are json files, notably difficult to inspect for diffs

When discussing these shortcomings, I often hear the challenge: “Well, but it is actually possible to do [ testing | clean code | modules | versioning | … ] in a notebook. True, but it doesn’t mean that because it’s not impossible, that it’s easy, or pleasant.

An example: Predicting who will survive the Titanic

There is a well-known Kaggle competition where you have to predict who will survive the titanic based on their age, sex, class, and other characteristics.

The notebook version

One solution is found here: https://github.com/donnemartin/data-science-ipython-notebooks/blob/master/kaggle/titanic.ipynb. For the purpose of this blog, we won’t zoom into the actual algorithm used, but we are going to have a look at the solution notebook, stripped down for simplicity:

This works great on the machine of the researcher. Which flaws can you detect if you were to industrialise this code? Here’s what we bumped into:

For sure, more can be said about this code. But for the sake of the example, I think we have enough to work with.

Containerizing our use case

Data products benefit from using software development best-practices. See our Data Engineering Manifesto webinar for more info. To be deemed production-ready, our code should be:

  • 📜 Versioned
  • 🔁 Reproducible
  • 🧱 Well-organised
  • ✅ Tested

What we aim to achieve:

  • Run the Titanic survival prediction as a container
  • Passing the input files, secrets and output path
  • Getting the survival predictions in the output path

Let’s work towards that now 💪. We won’t dive into the specifics of every little step. But you can recreate our solution in this github repository: https://github.com/datamindedbe/webinar-containers

Version control and scaffolding

First, we need to get the code out of the notebook, and in a normal Python package. We create a setup.py for that with all the right details. This might seem intimidating at first, but you get used to it. The point of making a package is to have something that you can reuse wherever you want: on your local machine, in a container, in production, yes even in a notebook.

Next up, we lock our versions:

requirements.txt

Here, we make explicitly clear which versions you need of which libraries. If you also want to make sure that all upstream libraries that these libraries used are also pinned, you can use the “pip freeze” command. This guarantees that 1 day, 1 week, 1 month or 1 year later, I can still run the exact same code in the exact same way, and it will give me the exact same output. In other words, this is a big step in creating reproducible code.

We also restructure the code so that we abstract away the hard-coded paths and the secrets. We also make sure to create methods where applicable. And we add unit tests:

Unit tests

For all changes, it’s easiest to just browse through the code repository linked above. We’ve come a long way. Our code is versioned, tested, and well-organised and reproducible. Now it’s time to prep the code for deployments.

Enter the containers

A brief primer on containers. We definitely don’t aim to be exhaustive here. There are plenty of tutorials available. Which problems do containers solve?

Containers are becoming the de-facto standard to industrialise any kind of workload. And docker is the most well-known implementation of containers. What does this look like in practice?

In short, you write a docker file, which builds a docker image. Each line in your docker file is a new layer of your docker image. Then, you can load your docker image into a container and run it somewhere (laptop, cloud, server, …)

Our Dockerfile looks like this:

Again, this might seem intimidating but it is pretty straightforward:

  1. You start from a base image, python:3.9
  2. You set your work directory and a bunch of environment variables
  3. You install and upgrade pip
  4. You install your requirements.txt
  5. You copy your source code and install it
  6. You set an entry point, the command that will be executed when running the image

All logical, right? So, lets’ now build this Dockerfile, with the following “docker build” command:

You see the layers being built in your terminal. The “-t titanic-survival-prediction:v1” is a tag, to label your container for later use.

Now it’s time to run the container on your laptop with the “docker run” command:

Here we reference the tag again. You see we pass in all the parameters and all the secrets, so that the code can run regardless of where your data is and without any hardcoded passwords.

Wrap-up: What just happened?

In order to turn a notebook to a production-ready app, we:

  • 📜 versioned our code using git
  • 🔁 pinpointed our dependencies in requirements.txt
  • 🧱 organized our code so as to define clear single-purpose functions, make those functions testable, push the side-effects to the outer layers of the application, exposed the configuration of the application
  • ✅ wrote unit tests
  • 📦 declared the whole environment and setup in a Dockerfile

🚀 We ended up with a portable image that we can run anywhere as a container.

As mentioned above, you can check out the source code here: https://github.com/datamindedbe/webinar-containers. Note that this example is simplified for the purpose of clarity. In reality, it’s not always as straightforward.

Next steps?

This blog is actually a written version of a webinar we did a few weeks ago. You can rewatch that webinar here: https://www.dataminded.be/rewatch-our-webinars This was part 1 of a 3 part webinar.

Now that your code is in a container, in our next webinar, we will dive into how to deploy these containers in the cloud. And the 3rd webinar will be about orchestrating many containers to run data products at scale. You can subscribe here: https://www.dataminded.be/webinars

Accelerate your journey with Conveyor

If you like what you see, you should definitely check out our product Conveyor: https://www.dataminded.com/conveyor which makes it super easy to containerise, deploy and orchestrate all your workloads at scale. As mentioned at the start, we do believe notebooks have a place in the data products lifecycle, and that’s why, launching in Q4 2021, we offer support for notebooks as well.

Many thanks to Pierre Borckmans for preparing the code and doing all the work to make the webinar a success. Life is a pleasure if you get to work with great engineers like Pierre.

--

--

Kris Peeters
datamindedbe

Data geek at heart. Founder and CEO of Data Minded.