How to Record Activity in JupyterLab and Amazon Sagemaker Studio

And why InfinStor’s MLflow Kernel is the best solution for recording data science activity

Adhitya Vadivel
InfinStor
4 min readMay 18, 2022

--

A system like MLflow is necessary for keeping track of data science experiments. InfinStor’s MLflow Kernel takes this tracking capability to the next level, integrating with JupyterLab and Amazon Sagemaker Studio.

So how and why is MLflow Kernel the leading solution for recording data science activity today?

Data Science and Lab Science

As a software engineering discipline, data science is quite similar to traditional lab science. Let’s consider a chemistry lab in which a chemist performs a series of experiments and tweaks a few of the experiment parameters, such as changing the temperature or pressure.

They change the catalysts so each and every model or parameter of the experiment and their corresponding results have to be recorded. Any other observation that the scientist makes has to be recorded in a notebook. Traditionally, this is done in a lab notebook.

Photo by Fabio Ballasina on Unsplash

Data science is also a highly iterative process. For example, a data scientist performs an experiment a hundred times and at every step, they tweak certain parameters such as the code and the structure of the model. They then run the experiment and record the results.

Every piece of recorded information is the knowledge that a data scientist or their organization wants to track for intellectual property reasons or even just for comparison with various experimental models that they have been tracking over the process.

MLflow Benefits

MLflow offers these capabilities by tracking each experiment run and recording them in the ML infrastructure. It helps data scientists stay organized because, for data scientists who are tweaking the code or the parameters, it is easy to lose track of the events of the experiment.

MLflow also has a model registry where users can track every version of the model that is managed and its lifecycle. When a data scientist trains a new model, they archive the old version of the model so that the new model goes to staging and then production. Each of these steps needs to be recorded.

MLflow is open source. It has a thriving open source community and is one of the fastest-growing open source projects. When it comes to an enterprise setting, however, there is a need for more large-scale capabilities.

In a large enterprise, it is possible that hundreds of thousands of data scientists are concurrently working on running experiments and using the MLflow infrastructure. But if the system goes down, the productivity goes down. It is crucial that if there is a disaster, like a data center going offline, data should be recovered and redeployed promptly in another region.

These capabilities are not available in open source MLflow but can be found in InfinStor’s enterprise MLflow service, where users can manage computations of ML pipelines using the InfinStor parallel processing engine.

MLflow Kernel Overview

The MLflow Kernel ensures each version of a cell code is recorded. It refers to the cell output of a previous run and uses data from graphs that were produced in older runs.

MLflow Kernel is implemented as a simple wrapper around an IPython kernel. It automatically starts the MLflow runs, tracks cell outputs, logs the underlying cell code, and captures any output produced as artifacts of the run.

When the MLflow Kernel is instantiated, it creates a parent run in MLflow. All the cell executions are captured as child runs for that parent run and every output of that cell goes as part of that child run.

Consequently, the entire session of the kernel is recorded in one place in an MLflow infrastructure. It integrates perfectly with auto logging.

MLflow Kernel is an open source project with an Apache license.

The Git repository includes all the instructions for deployment and integrations with InfinStor MLflow, JupyterLab, and Amazon Sagemaker Studio.

To get started with MLflow Kernel, users will need a free InfinStor account. On the InfinStor home page, users can sign up for an account after providing an AWS S3 location to store the artifacts and some credentials for that artifact location.

InfinStor Free Sign Up Process

MLflow Kernel automatically captures and records anything that is produced as an output to a cell, including non-image binary files, graphs, or audio files. Usually, when a cell is executed, there are several artifacts generated that will be logged in MLflow. However, anything that is rendered on the screen is usually lost.

MLflow Kernel guarantees that anything that is logged into the screen is not lost. It will always be available and referred to.

Conclusion

With the help of the new MLflow parallel processing engine, InfinStor unifies the data lifecycle management, compute management, and MLOps to improve the productivity of data scientists.

InfinStor is the leading AI software solution for unstructured data.

For more information on MLflow Kernel, visit us at infinstor.com and follow us on LinkedIn and Twitter.

The content of this article was discussed in InfinStor CTO Jitendra Pandey’s presentation, Automatically Record Data Science Activity with MLflow Kernel.

--

--