Why Jupyter Is Not My Ideal Notebook
From notebook prototyping to production the right way
Read the original article on Sicara’s blog here.
Jupyter notebook has been reported as the preferred prototyping tool for data scientists. This post presents the fast pace from EDA to API. Without Jupyter.
Jupyter main features are:
- inline code execution
- easy idea structuring
- nice displays of pictures and dataframe
This overall flexibility has made it a preferred tool compared to the more rustic iPython command line. However it should not be forgotten that this is not more than an REPL where you can navigate efficiently throughout the history. Thus it is not a production tool.
However, tons of machine learning developers have experienced the deep pain of refactoring a deep learning notebook into a real algorithm in production (also reddit or stackoverflow).
Keeping a lean mindset, we should strive to reduce waste as much as possible.
Introduction
At Sicara, we build machine learning based products for our customers:
- machine learning: the customer comes with a business need and we have to deliver a satisfying algorithm as fast as possible;
- we build products: we need to develop in a production-ready mindset. Algorithms are deployed in the cloud, served and updated with APIs, etc.
First of all you definitely need a versioning tool which is a pain with Jupyter (also reddit, reddit again, quora). Not only for your code, but also for your experiments. You need to be able to re-run any results got so far with 100% confidence. How often come data scientists with results they cannot reproduce?
Furthermore, when using notebooks, people often tend to mix three kinds of usage:
- development: defining methods and tools to actually do something;
- debugging/applying: running the piece of code with real data to see what is going on;
- visualization: presenting the results in a clean and reproducible output.
In order to reduce waste, these steps should be clearly defined and separated so as to be able to change one without the other and vice versa. I have come to the conclusion that
- to produce high-quality tested code, better using a first-class IDE
- to debug code, there are visual debugging tools
- to write down reports, I am more comfortable with an expressive markup language (markdown, reST, Latex)
Fortunately a well-configured IDE can do all of these things. For instance if you come from the R community you certainly use RStudio which allows you to do so:
- native code completion, auto-fix, etc.
- direct visual debugging
- Rmarkdown/knitr/Sweave to generate dynamic and beautiful reports.
Develop production-ready code
As soon as you want to make an experiment, i.e. write a method to do something to your data, you should think about its usage, limit case, etc. Do it in a separate file, document and unit-test it. Doing so you make sure that:
- your method actually does what you want;
- your code can be safely used somewhere else in your project.
Because you will have to organize your tools, it makes you think about the structure of your pipeline, the things you need, what you are likely to change, etc.
…
Read the full article here.