Multicloud MLflow

Jitendra Pandey
Apr 12 · 4 min read

MLflow is rapidly expanding as the most popular open source technology to manage the lifecycle of ML projects. MLflow has the right abstractions and tools to track and manage your ML projects as well as models. An MLflow deployment along with the cloud storage, compute infrastructure build an ML service platform that helps enterprises to fulfill their AI aspirations. In this article I will talk about some of the essential aspects of an ML platform that unlocks the power of MLflow.

Multicloud Computing Platform

AWS, Azure, Google, Oracle and a few others are injecting billions of dollars in their public cloud infrastructure. These investments are going into creative pricing models as well as diverse hardware choices. As the demand of GPUs continues to grow, we can expect to see a rapid growth in choices available to data scientists when picking cloud instances. The computing platform must allow data scientists to run their experiments in hardware of their choice across clouds. An experiment may consist of multiple stages of processing, and the platform should be able to schedule different stages in different clouds to optimize the costs. MLflow service integrated with such a computing engine must be able to track your experiments across different clouds.

Capture Data Snapshots with an ML run.

MLflow tracking provides the history of runs, and all the parameters associated with it. It is important that the snapshot of the code as well as the data is tracked along with the run. The code snapshot is often captured using a git hash, which could potentially be logged as an artifact, or even better if the platform records it automatically. Data snapshots get tricky unless the storage platform provides snapshots. For cloud based object stores, a service like InfinStor is a must that automatically captures snapshots and provides exact snapshot of data that was used to train a model. Every stage of processing produces data that is fed into the next stage of processing. This intermediate data must also be correctly tracked along with the MLflow-run information.

Repeatability & Cached Runs

MLflow enables granular tracking of each stage of processing. The advantage of such tracking is to be able to repeat an experiment and reproduce the results. In a multi-stage pipeline, the ability to re-execute from an intermediate stage helps much faster iterations, and saves on resources. The service should let you execute a stage using output of the older runs of an intermediate stage. This capability in the services lets you take advantage of the history of runs that MLflow records. This also incentivizes one to decompose the ML training or serving program into multiple stages and track each stage in separate runs. It is crucial that this notion of repeatability is built-in into the platform that let’s one compose multi-stages ML programs and track them separately, under the umbrella of the same experiment.

Cached output of a prior stage can be reused.

Organization of experiments and runs

MLflow allows one to organize their experiments and runs in a hierarchy for effective organization and tracking. An experiment has an experiment-id and its different runs can be given a different run-id. However, it is possible that an experiment consists of multiple stages of processing, so it might be better to assign a parent id to each run of the experiment, and each stage of processing refers to this parent run. It is also possible that these experiments are run periodically in a continuous learning environment, and we want to organize the runs under each launch of the experiment. This organization is really important to be able to track the experiment runs effectively. The service must provide easy ways of organizing the experiments and their runs. Service must allow one to execute a multistage execution of experiments where each stage is tracked in a separate run.

Right Access Control

An experiment run leaves a rich trail of data for scientists to analyze. This trail is useful for debugging and for keeping the history of runs that produced a model. Data scientists may want to share this data with different team members or across the organization. To understand a model it is often important to understand how it was created, and therefore model sharing necessitates the sharing of MLflow run information as well. This sharing requires appropriate security policies and access controls in place. Data scientists must be able to carefully control what is visible outside their team and which privileges are exposed. These security policies must also account for the visibility of data sources as well as the intermediate data being produced at different stages of processing.

Infinstor Features a True Multicloud MLflow Service

InfinStor’s MLflow service is a true multicloud service. Infinstor’s ML platform includes a compute engine, storage snapshot technology and security infrastructure to enable the capabilities desired in an MLflow deployment as listed above. InfinStor provides flexibility to use the standalone MLflow service, or use it along with the compute platform to enable multicloud scheduling and serving as well.