Performance Optimization of Jupyter Notebook: A tutorial on how to optimize the performance of Jupyter Notebook, including how to use profiling tools like line_profiler and cProfile, and recommended best practices.

Published in

Javarevisited

5 min readJul 21, 2023

Introduction

Jupyter Notebook is a great tool for data analysis, machine learning, and interactive coding. However, as the size of your code and data grows, Jupyter Notebook performance can become an issue.

This tutorial will discuss some techniques to optimize and improve the performance of Jupyter Notebooks.

Note

If you are looking to quickly set up and explore AI/ML & Python Jupyter Notebook Kit, Techlatest.net provides an out-of-the-box setup for AI/ML & Python Jupyter Notebook Kit on AWS, Azure, and GCP. Please follow the below links for the step-by-step guide to set up the AI/ML & Python Jupyter Notebook Kit on your choice of cloud platform.

For AI/ML KIT: AWS, GCP & Azure.

Why did you choose Techlatest.net VM, AI/ML Kit & Python Jupyter Notebook?

In-browser editing of code
Ability to run and execute code in various programming languages
Supports rich media outputs like images, videos, charts, etc.
Supports connecting to external data sources
Supports collaborative editing by multiple users
Simple interface to create and manage notebooks
Ability to save and share notebooks

Use profiling tools

Profiling tools can help identify bottlenecks and slow parts of your code.

line_profiler: This profiler inserts timing hooks into your code line by line. You can decorate functions with @profile to time them.
cProfile: This is a standard Python profiler. You can run it like this:

cProfile

cProfile is a built-in profiler in Python that traces every function call in your program. It provides detailed information about how frequently a function was called, and its average execution times. As it comes with the standard Python library so we do not need to install it explicitly. However, it is not suitable for profiling live data as it traps every single function call and generates a lot of statistics by default.

import cProfile

cProfile.run('your_function()')

This will generate a profile report showing time spent in each function.

For Example:

import cProfile

def sum_():
    total_sum = 0
    # sum of numbers till 10000
    for i in range(0,10001):
        total_sum += i 
    return total_sum

cProfile.run('sum_()')

Output

4 function calls in 0.002 seconds
Ordered by: standard name

As you can see from the output, the cProfile module provides a lot of information about the function’s performance.

ncalls = Number of times the function was called
tottime = Total time spent in the function
percall = Total time spent per call
cumtime = Cumulative time spent in this and all sub-functions
percall = Cumulative time spent per call.

Line Profiler

Line Profiler is a powerful Python module that performs line-by-line profiling of your code. Sometimes, the hotspot in your code may be a single line and it is not easy to locate it from the source code directly. Line Profiler is valuable in identifying how much time is taken by each line to execute and which sections need the most attention for optimization. However, it does not come with the standard Python library and needs to be installed using the following command:

! sudo pip install line_profiler

from line_profiler import LineProfiler
def sum_arrays():
    # creating large arrays
    arr1 = [3] * (5 ** 10)
    arr2 = [4] * (3 ** 11)
    return arr1 + arr2

lp = LineProfiler()
lp.add_function(sum_arrays)
lp.run('sum_arrays()')
lp.print_stats()

Output

Timer unit: 1e-07 s
Total time: 0.0562143 s
File: e:\KDnuggets\Python_Profilers\lineprofiler.py
Function: sum_arrays at line 2

Line # = Line number in your code file
Hits = No of times it was executed
Time = Total time spent to execute the line
Per Hit = Average time spent per hit
% Time = Percentage of time spent on the line relative to the total time of function
Line Contents = Actual Source Code

Properties of profiling tools

Optimize slow cells

Vectorize operations: Use NumPy vector operations instead of Python for loops where possible.
Use Cython or Numba: These can compile Python functions to optimize machine code.
Parallelize computations: Use multiprocessing or joblib to parallelize CPU-bound tasks.

Cache data

Cache data loaded from files using .cache() in Pandas.
Cache data loaded from a database using .cached() in SQLAlchemy.
Use mem_info() to check your memory usage.

Optimize imports

Import libraries at the top of the Notebook to avoid repeated imports.
Use %load_ext to import extensions once at the top.

Other tips

Use %%time magic to time cells.
Use dask for out-of-core and parallel computing.
Use %%prun magic to profile cells.
Restart the kernel periodically to clear caches.

Conclusion

In conclusion, optimizing the performance of Jupyter Notebook is crucial for smooth and efficient data analysis, machine learning, and interactive coding. We discussed several techniques and best practices to achieve this goal.

Profiling Tools

Utilizing profiling tools like `line_profiler` and `cProfile` helps identify bottlenecks and slow parts of the code. These tools provide detailed information about function execution times, helping us focus on areas that need optimization.

Vectorization and Compilation

To optimize slow cells, we should leverage NumPy vector operations instead of Python loops, use Cython or Numba to compile Python functions into machine code and parallelize CPU-bound tasks using multiprocessing or joblib.

Data Caching

Caching data from files or databases using Pandas `.cache()` and SQLAlchemy `.cached()` can significantly speed up data loading processes, especially when dealing with large datasets.

Optimize Imports

Importing libraries at the top of the notebook and using `%load_ext` to import extensions once at the beginning prevent unnecessary repeated imports, improving notebook performance.

Memory Usage

Regularly check memory usage using `mem_info()` to identify potential memory leaks or high memory-consuming cells.

Other Tips

Utilize Jupyter Notebook magics like `%%time` to time cells, use Dask for out-of-core and parallel computing, and use `%%prun` magic to profile individual cells. Additionally, restarting the kernel periodically can help clear caches and improve overall performance.

By following these best practices and utilizing profiling tools, we can optimize the performance of Jupyter Notebooks, enabling us to work efficiently with larger datasets and complex code. Always remember that continuous monitoring and optimization are essential as the code and data evolve, ensuring a smooth and productive data analysis experience.