Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

DS in the Real World

How to refactor a Jupyter notebook

Improve your codebase and become more productive with these techniques

5 min readMay 4, 2020

--

Press enter or click to view image in full size
(TL;DR — this picture summarises this whole article)

In the ML world, code can get messy, quickly.

What starts as an awesome ML model easily becomes a big blob of code that’s hard to understand. Modifying code becomes painful and error-prone, and it becomes increasingly difficult for ML practitioners to evolve their ML solutions to satisfy new {business requirements, feature engineering strategies, data}.

In this article, I will share with you my process of refactoring a Jupyter notebook, and show you how I take it from an unmaintainable state to a {readable, tested, maintainable} state. Once our codebase is comprehensively tested and is easily understandable, it’ll be much easier to extend and evolve our ML solution.

Note: If you’re not convinced of the need to refactor your Jupyter notebooks, check out this article 👻

1. Six preparatory steps that enable refactoring

Refactoring is changing code to make it easier to understand and modify, without changing its observable behaviour. (Paraphrased from Refactoring — Martin Fowler)

Code is not always refactorable.

For a long time, I’ve stared at my Jupyter notebooks and thought about how I could make it better. I know I should be writing tests and refactoring my code. But there are some boulders preventing me from taking my first step: (i) fear that I might break something, (ii) fear that I might delete code that someone else needs, (iii) cumbersome mechanics of refactoring in Jupyter notebooks (e.g. try renaming a variable).

After running this refactoring workshop twice, I’ve come to discover six preparatory steps that enable and accelerate refactoring:

  1. Run notebook from start to end and ensure everything works.
  • This will save you the needless pain of having to figure out whether we broke something while refactoring or whether the code was already broken.

2. Make a copy of the original notebook.

  • This is a surprisingly important step. It will free us up from any emotional attachment (“mmm i don’t know, someone might need this chart”) and allow us to ruthlessly clean up any code that’s not essential.
  • By separating these concerns (namely, core data transformations and presentation), we can have two separate things that handle each concern well, instead of one gigantic notebook that simultaneously does both in a disastrous fashion.

3. Convert Jupyter notebook into a plain Python file.

  • This will allow you all the benefits of using an IDE (e.g. autocomplete, intellisense, inline documentation, formatting, auto-renaming, linting, keyboard shortcuts, etc). This makes you more efficient in your refactoring.
  • Command: jupyter nbconvert —-to script mynotebook.ipynb

4. Remove print statements, e.g. print(...), df.head(), df.plot(...)

  • This removes noise and visual clutter and makes the next step exponentially easier.

5. Read notebook and list code smells.

  • This list of code smells that you identify becomes the todo list in your refactoring (see example). It will also save you some mental resources from constantly thinking about what to refactor next.

6. Define refactoring boundary and add a characterisation test.

  • A characterisation test treats your program as a black box and characterises its behaviour (e.g. my notebook outputs a model that has an accuracy score of 68%) and asserts on that characteristic (i.e. the test fails if we run our code and get a model with an accuracy score less than 68%)
  • This is arguably 👏 the 👏 most 👏 important 👏 step!
  • Having a characterisation test will give you fast feedback because it can be run continuously as you refactor. If you accidentally introduce breaking changes, it will tell you within seconds.
  • Without the characterisation test, you would have to manually restart and rerun the entire Jupyter notebook ever so often — which is cumbersome, highly disruptive to your flow, and so 2019.
  • This step is a little hard to explain in written words, so I’ve recorded this demo to show you how to define a refactoring boundary and write a characterisation test to enable your refactoring.

2. The refactoring cycle (yay!)

Press enter or click to view image in full size

In the refactoring recycle, we incrementally and iteratively improve our code by (i) adding unit tests and (ii) abstracting complex implementation details into modular and readable functions.

The refactoring cycle is as follows:

1. Identify a block of code that can be extracted into a pure function (i.e. a function that returns the exact same output for a given input, no matter when or where it’s run).

2. Write a unit test. (see demo)

  • Run the unit tests in watch mode: e.g. nosetests -—with-watch —-rednose
  • Write a unit test for the code block

3. Make the test pass.

  • Define a new function, and place it in a new Python module/file (or an existing one if there’s a suitable one in your code)
  • Move existing implementation from notebook into that function
  • Make the failing test pass

4. In the notebook, replace original code block with the newly defined function.

5. Ensure characterisation tests are still passing.

6. Commit your changes to git.

Now, if you feel that the code that you just extracted could be improved some more, you can refactor it further with the safety net of your new unit test.

Otherwise, you can repeat the refactoring cycle on another code block in the Jupyter notebook, and repeat until the codebase is comprehensively tested and refactored into a readable and maintainable state.

3. What did we get for all that trouble?

In a nutshell, you can go from a messy, hard-to-read notebook to a tested and modular codebase. 🎉🎉🎉

The combination of automated tests and refactoring can go a long way to help us in:

  1. Shortening feedback cycles. With automated tests, you can now know immediately as errors and bugs are introduced. If the test coverage is comprehensive, it will also give you the confidence to know that everything is good to go, and save you time from manually testing the entire Jupyter notebook.
  2. Reducing waste. With tests and modular functions, you can reduce the effort spent on (i) reading implementation details which are irrelevant to your task, (ii) holding the entire Jupyter notebook in our head even though we just wanted to change one simple thing, (iii) fixing errors that we accidentally introduced yesterday, and (iv) [insert the thing that made you stare at your notebook for hours on end].
  3. Increasing flow. All of these means that you can focus on the task at hand (e.g. integrating a new set of features) and deliver value, instead of the tedious, wasteful tasks mentioned above 😎

Thank you for reading this far! I hope this has been helpful for you 🚀🚀🚀

To see a hands-on example of how I refactor a Jupyter notebook, check out this demo!

This is part of a tutorial series, Coding Habits for Data Scientists, which aims to help data scientists become more productive by learning good programming habits.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

David
David

Written by David

Author of Effective ML Teams (O'Reilly)

Responses (2)