Productionizing Jupyter Notebooks with Versatile Data Kit (VDK)

How to use VDK to turn your Jupyter notebooks into scalable and reliable data pipelines

Mr. Ånand
Versatile Data Kit
7 min readNov 9, 2023

--

Photo by Isaac Smith on Unsplash

Introduction

In the modern fast-paced digital landscape, there is a high demand for streamlined and efficient data management tools and services. With the exponential growth of data-driven development and decision-making in organizations, there is a need for robust solutions to optimize data pipelines. Jupyter Notebooks have emerged as one of the most popular choices for organizations to use in data exploration and analysis due to their interactive and user-friendly interface. However, as the scale and complexity of data operations grow, the requirement to easily migrate Jupyter Notebooks into production environments becomes more essential.

Here comes the Versatile Data Kit (VDK), a framework that simplifies data ingestion and data processing. A toolset enabling you to run data jobs & comprehensive solutions for the productionization of Jupyter Notebooks. With its powerful features and capabilities, VDK serves as a game-changing tool that enables organizations to easily integrate Jupyter Notebooks into complex data pipelines while ensuring scalability, reproducibility, and enhanced workflow efficiency.

In this blog, we see all about Productionizing Jupyter Notebooks with VDK, exploring the potential of this open-source toolkit to revolutionize the way organizations manage and process their data.

Understanding the Challenges with Jupyter Notebooks

Despite their popularity in data analysis and exploration, Jupyter Notebooks have several inherent limitations when used in production environments. Sometimes these limitations of the Jupyter Notebook pose significant hurdles during seamless integration into complex data pipelines. Understanding these challenges is important for companies aiming to use Jupyter Notebooks within their production workflows.

Some of the biggest challenges include:

  • Scalability and Performance: Jupyter Notebooks sometimes struggle to handle large-scale data processing efficiently. When dealing with large datasets or sophisticated computations, the lack of optimised memory management and the sequential nature of execution might cause performance bottlenecks, impeding the seamless expansion of data pipelines.
  • Version control and Collaboration: Version control for Jupyter Notebooks in a collaborative environment can be difficult. Merging changes, managing revisions, and ensuring consistency across various versions of a notebook may become time-consuming, potentially resulting in version conflicts and data inconsistencies, especially when numerous team members are working at the same time. JSON file with lots of irrelevant information as you can see in the image below.
JSON file without VDK
  • Reproducibility and Environment Management: During the deployment of Jupyter Notebooks in production, there is a significant challenge to ensure the reproducibility of results. Variations in the runtime environment, dependencies, and external libraries can all have an impact on the consistency of results, making it difficult to accurately replicate analyses or experiments, especially when switching across computing environments.
Environment Management
  • Security and Access Control: If proper security measures and access controls are not implemented in a production environment, Jupyter Notebooks can present security risks. Allowing unrestricted access to notebooks or exposing sensitive data might jeopardize data integrity and confidentiality, potentially leading to security breaches and unauthorized data tampering.

Few more challenges exist like irrelevant code, modularization and no proper method for testing due to lack of libraries and tools.

Solving Problems with VDK

VDK plays an important role in facilitating the seamless integration of Jupyter Notebooks into production pipelines. It is a robust and comprehensive framework designed to simplify the complex process of data ingestion, transformation, and deployment within production pipelines using Python and SQL. Its capabilities for version control, environment management, and scalable data processing enable organizations to overcome the inherent challenges associated with deploying Jupyter Notebooks in production environments. Check the VDK GitHub repo to learn more.

We have discussed some of the challenges with Jupyter Notebook above, it’s time to see how we can solve those challenges with VDK.

Non-Linear Execution and Hidden State Risks

Notebooks support non-linear code execution, which might result in hidden dependencies when cells run out of order. It increases risks when moving to the production environment. Follow this image for understanding.

To know about problems, try to focus on the first two objectives — Retrieve Data and Data Cleaning. To retrieve the data, We will first import pandas and then load it. After the execution, we can check the data. Now we will clean the data by removing some testuser . As you can see, during retrieving and cleaning the data we don't know what exactly was executed, it's hidden. In the production environment, we want to be sure what will be executed, therefore no hidden dependencies or states are required. Let's see the VDK solution, we will use the VDK cell tag (see top right in the image). Basically, it will add numbering to the cell(tagged) as it will be executed in order 1,2,3 when it is deployed. We are now sure about no hidden dependencies or states and we can only execute what we see.

Irrelevant Code

Excessive irrelevant code like unused statements or unrelated snippets can be found in notebooks. During the experimental stages of development, a few snippets and useful algorithms that help in interactive changes can be problematic in production.

Let’s see the VDK solution, we will do the data classification by assigning scores into predefined categories for clarity, as you can see in the image below. After executing code from cell 9, we got a new column which contains the types of users.

To check if something unknown is added to this data, we have defined methods in helper.py file. After the visualization, we are sure that only Detractors and Promotors are present and classification is right. Now we have to ingest the organized data using vdk job_input

 # sending data for ingestion
job_input.send_tabular_data_for_ingestion(
df.itertuples(index=False),
destination_table="nps_data",
column_names=df.columns.tolist()
)

If we see our whole code we can know for sure that df and visualize_data(df) will not be executed in production. Some codes are relevant for development and irrelevant to deployment so we can't just remove these always. VDK helps here to maintain performance without removing irrelevant code.

Testing

When we work with Jupyter Notebooks there are no proper tools and methods to test notebooks. It’s not easy to check the code you have written and need to use the alternate solution, but not viable if working in a big team. VDK provides an end-to-end testing solution to this via run command, as you can see in the image below.

After running this command it will execute vdk cells of the above Jupyter Notebook like a production environment. If something fails, it will give the error message and the details of where it originated from. It will be easier for us to fix things.

We can also use the testing feature from the create deployment method by checking the Run data job before deployment box. It will also behave the same and in case of any failures, it will stop the deployment and show us the error and the exact trace.

Version Control

Version control with notebooks can often become a complex process due to their JSON-based format, which can contain excessive noise. VDK offers a solution to this predicament through two key features. See the version control after using VDK in the image below.

Firstly, it implements Noise Reduction, a mechanism that cleanses the notebook’s JSON by eliminating non-essential elements, including execution counts and outputs, known to contribute to the “noise” in the data. This helps streamline the version control process by focusing solely on the relevant content. Secondly, VDK ensures Seamless Integration with Git, facilitating the seamless committing of code to Git during deployment in a cleaner state, free from unnecessary metadata. By simplifying version control and minimizing clutter, VDK enables a more efficient and streamlined workflow for managing notebook-based projects.

If you want to understand what we are doing in a step-wise manner with Jupyter Notebook UI, check this guide.

Conclusion

To sum up, the application of VDK in the productionization of Jupyter Notebooks has numerous benefits. By using the Noise Reduction feature, VDK significantly simplifies the version control process by eliminating unnecessary components that could hinder effective data management. Furthermore, VDK’s seamless integration with Git ensures a tidier and more organized environment for code deployment and collaboration, reducing clutter and simplifying the workflow. Adopting VDK can greatly enhance data pipeline management, enabling users to streamline their operations and improve overall workflow efficiency. It is a valuable resource for those aiming to fully utilize their data-driven projects.

Additional Resources

💡Check Versatile Data Kit GitHub Repo: https://github.com/vmware/versatile-data-kit

💡Check Youtube Video Tutorial: https://youtu.be/U6M6UzsoiqY?si=iBPD6NH4bdKUZGx4

💡Check the Getting Started guide of VDK to learn more: https://github.com/vmware/versatile-data-kit/wiki/Getting-Started

💡Check VDK in Jupyter Notebook UI guide: Deploy Data Job through the Jupyter UI · vmware/versatile-data-kit Wiki · GitHub

💡Go through VDK user guides: https://github.com/vmware/versatile-data-kit/wiki#arrow_right-user-guide

--

--