How to Easily Transition from Developing with Jupyter-Notebooks to Python Scripts
Like most of the data scientists out there, I started my journey in the data science world processing and visualizing data using Jupyter notebooks. It is easy, intuitive and flexible for the exploratory data analysis (EDA) phase as we get the results fast and are able to understand the data on the spot.
We naturally continued to engineer the features and train our models with notebooks, as our initial code is already found there and it’s a convenient environment for us. Notebooks are great for the EDA phase, for tutorials and reports. There are also tools that can deploy notebooks straight to production. However, when we need to deploy the code with Python scripts, the path to production will be longer, because of the notebooks. In addition, when using notebooks, we can’t have all the benefits of developing using an IDE like PyCharm. That may save a lot of developing time and bugs.
So which path should you choose?
This article is relevant for:
- New and experienced data scientists
- Data scientists who deploy models using Python scripts
Why I chose to stop developing with notebooks
- It is hard to debug code on Jupyter notebooks
- Git doesn’t work well with code changes tracking in notebooks
- Deploying to production takes a lot of time: you need to transfer your notebook into scripts and modules, create readable functions, add docstrings, tests and more
- Sharing with a team member and updating a notebook that is constantly changing can be a tedious process
- Notebooks tend to contain a lot of redundant code, because of the need to display the data, look in graphs, print data head, shapes or columns
Tips for developing using scripts
A few years ago I decided to take a brave step and begin working on new data projects using Python scripts right from the project start. This way I didn’t have the friction in transferring my code from Jupyter notebook to Python scripts. When using this methodology, my IDE is PyCharm professional and most of the features can also be found in the community edition. My code is divided into scripts and modules and the project is synced with git right from the start. The next tips will equip you with the right tools to successfully begin developing with PyCharm IDE.
Must know keyboard shortcut
In order for you to develop easily and fluently with datasets, you need your data to be loaded once to save time, and to repeatedly execute each line or code chunk that you experiment with in real time. This behavior that is essential in notebooks can also be found in Python scripts using PyCharm IDE. Therefore, the first and most important thing to know about PyCharm is that you can execute your script line by line, or multiple lines, just like you do it in notebooks, using the keyboard shortcut: ALT + SHIFT + E
(or OPTION + SHIFT + E for mac).
The selected code will be executed using the Python Console. Get yourself familiar with this shortcut. You will find out that it is pretty similar to notebook execution line by line or cell by cell but with a more convenient development environment.
Execution with Python Console
The Python Console consists of two windows:
- The left window is the current terminal of your line by line execution and outputs. You can execute temporal commands in this console, like view the column names of specific pandas dataframe, or any other information you want to inspect regarding your data and variables.
- The right window is the Variables tab. There you can see all the variables of your current session. The most helpful information is the variable size and types. Also, if you want to inspect a dataframe or array, click on the View as DataFrame option to the right of the variable name and you will see its content in the SciView window, under the Data tab.
Pro tip — in the bottom of the Data tab you will find a command row with the name of your variable. You can type different commands in order to see different views of the data in real time, without the need to assign it to another variable name. For example, type data.isnull().sum()
and you will receive the next results:
Graphs visualization
This is where you really need the professional edition.
The community edition shows the graphs you executed in the script with the IPython package by prompting a new window for each plot, which over time tends to freeze your PyCharm session. The professional edition, on the other hand, handles it quite well using the Plots tab in the SciView.
Git versioning
Committing your changes to git is a good practice. In some work environments you need to do it daily. The more you commit the higher the likelihood to be able to return to previous code versions. However, when developing with notebooks, git is not so good in separating actual code changes from output changes or metadata of the code cell. This issue does not exist when developing with Python scripts.
PyCharm has a friendly UI for git once you download the git plugin. The project changes are tracked and you can see it under the Git tab, in a window named Local Changes.
Pro tip — for some PyCharm versions, the Local Changes window is not seen and you need to change the settings in order to see it: Settings -> Version Control -> Commit -> Use non-modal commit interface.
Debugging
Debugging with PyCharm is much faster and intuitive. Place a breakpoint in the line you want to investigate and execute your code in Debug mode. Then, you can navigate between the functions under the Frames pane, look at the content of the variables under the Variables pane, execute the next line and even execute new commands in the debug Console in order to identify the error.
Pro tip — When you don’t know where the origin of the error is, execute the code in Debug mode and it should stop and prompt you with the line that caused the error. If this feature is not working for you, you should configure it using this solution.
The trade-off
It is hard for data scientists to make the shift from developing in notebooks to developing in scripts using IDE . It takes practice and it takes time. However, if you feel that your coding skills should be improved and wish for faster deliverables to production, you should start with developing with scripts.
To sum up
Python notebooks might be helpful in the first steps of your research but if used further in development it can cause overhead in debugging, git versioning, sharing and deployment as in the end you will have to transfer your code to python scripts. Overhead that is being solved when developing with scripts in a dedicated IDE right from the start. This way, data scientists are getting closer to working like their ML engineer teammates and developers in general.
Further reading
- Git repo — Used code in this post and live demos
- Add a virtual environment in PyCharm: https://www.jetbrains.com/help/pycharm/creating-virtual-environment.html#env-requirements
- Connect remote interpreter to your local PyCharm: https://www.jetbrains.com/help/pycharm/configuring-remote-interpreters-via-ssh.html
or with a Kubernetes based solution: https://medium.com/bigpanda-engineering/okteto-as-a-sagemaker-alternative-6b97ac82b83d