Exploring InfinStor MLflow
What does it take to build an enterprise MLflow?
How does enterprise MLflow compare to open-source MLflow? Why is InfinStor the leading solution for enterprise MLflow?
As MLflow is critical for the success of an AI-driven enterprise, having an enterprise-grade MLflow service is a superior alternative compared to a locally run MLflow installation.
InfinStor MLflow is a complete MLflow experience with enterprise enhancements that enable you to build an enterprise-grade service out of your open-source MLflow. Let’s see why this is the case.
Limitations of Open-source MLflow
The open-source MLflow library is a fantastic interface and introduction to MLflow. However, it does not include a sizable number of factors, such as built-in authentication or authorization, scalability, high availability, and disaster recovery.
By default, if a user were to install the open-source MLflow library and start a run, MLflow would record in a directory in the local machine that the experiment is being run on. Although this can be useful for playing around with, it does not give the user corporate-wide access, and that particular experiment or run will not be accessible from any machine other than the one originally used.
Therefore, open-source MLflow is simply a start. An enterprise MLflow service will prove essential in order to make machine learning a corporate-wide effort.
Architecture of InfinStor MLflow
The architecture of the InfinStor MLflow service is illustrative in providing an idea of what it takes to build an enterprise MLflow.
In our architecture, we use Amazon DynamoDB as the store for all MLflow tracking and model registry information. Amazon Simple Storage Service, or S3, and other cloud object stores are where the artifacts themselves can be located but information about experiments, models, and authorization information is stored in DynamoDB. We have intelligently designed the schema for the DynamoDB so users can scale extraordinarily high.
All of the functionality of our MLflow service is implemented as REST APIs supported by Amazon Lambdas, due to their robust scalability and availability. In addition, users do not need to have servers running 24/7 as servers are provided.
For instance, during off-hours, a company with an MLflow system built using MySQL would have to size the system to the biggest load possible. If a team of 100 data scientists who need the biggest MySQL instance was running experiments, the company will end up with the biggest MySQL instance running 24/7. This results in unnecessarily wasted resources.
The Lambdas are fronted by the API gateway for improving the availability characteristics, a user interface with some static HTML and JavaScript originating from an S3 bucket, and Amazon CloudFront for improving the global availability of a given system.
Entering the system with a browser will load and call the static HTML files, JavaScript files, and REST APIs. This is all taken care of by InfinStor’s robust authentication and authorization. Direct REST API calls, such as a CLI MLflow run that calls directly into the REST API, can go directly through the API gateway without the need for a UI.
Our system has been tested with over 1500 concurrent users performing all kinds of data science activities that push our service to the limit with requests. Our scalability is exceptionally capable with very few resources because of our excellent DynamoDB schema and Lambdas. In addition, CloudFront is a CDN with world-class availability.
Authentication in InfinStor MLflow
The open-source MLflow library does not include authentication of any kind. We have included authentication in InfinStor by putting an AWS Cognito authentication system in front. This Cognito system can merge an enterprise’s native authentication system, whether it is an active directory, an Azure active directory, Google OAuth, Auth0, or one of the SAML 2.0 systems. InfinStor integrates seamlessly with all of these systems. Our service authenticates the web user interface, REST APIs, Python SDK, all other language SDKs, and service-to-service.
In machine learning, it is common to run work in a system in the cloud. Our solution for this is ICE, or the InfinStor Compute Engine. ICE can delegate a user’s work into multiple nodes and run it ten times faster than running it in a single node. There is no need to build custom Kubeflow workflows to get that done.
If for example, a company has a set of machines in the cloud or scripts for this purpose, it would still need service-to-service authentication. ICE can authenticate and data scientists will have the ability to run long training sessions. Whether experiments are being run for a whole day or just 10 hours, ICE takes care of all the details of authentication.
Authorization in InfinStor MLflow
InfinStor MLflow protects two resources: the experiment and the model. Experiment space can be private. This is useful for some of our customers, as each data scientist gets their own private namespace for experiments, and they do not share experiments with anybody else. However, data scientists can share when they generate a model and register the model with a model registry. This is where authorization comes into play.
Experiment space can also be shared across members of the group and this can be enabled easily with InfinStor. It provides the ability for multiple data scientists working on one project to have a shared experiment while having their own runs. Data scientists can share runs with each other normally and with their whole enterprise by publishing models to the model registry
For our authorization system, we use RBAC, and the roles we have defined are reader, editor, and manager. These roles are configurable and you have the ability to set up different capabilities for the roles, down to the granularity of individual REST calls.
For example, in our reader role, we want the readers to be able to set tags. Normally unless a user has the editor role, they cannot set tags on a model or a model version. But these settings can be changed and customized for their requirements.
When a user creates an experiment, they get the manager role and group members get editor roles. This is to enable other group members to create runs in that experiment and add runs for the purpose of collaboration. When a user creates a model, they get the manager role and group members only get reader roles instead. These settings are default but they are configurable.
High Availability in InfinStor MLflow
It is critical for a service that is deployed to all the data scientists in a given enterprise to be up and running at all times, especially when permissions, authorization, and authentication need to be managed.
For example, a group of data scientists come into work one morning and run their experiments. Their experiments throw Python exceptions because the MLflow service is down. This can result in a dramatic productivity decrease.
Those in the DevOps world and in enterprise applications know that critical services like a highly available MLflow service are essential for data scientists to perform their day-to-day work. So how does InfinStor address high availability in MLflow?
High availability refers to dealing with failures within a specific region. The data is stored in DynamoDB, which by default has three availability zones. In other words, when a user makes an entry to the DynamoDB table, there are three copies of that data in three different availability zones, and failure is taken care of automatically. This means if one of the AZs or one of the data centers were to go down, one of the other two would take over and service would be uninterrupted.
The REST API Lambdas are also deployed in three different availability zones and failure of one or more availability zones will not cause an interruption in service. The MLflow service will continuously be available. The static JavaScript and HTML files are fronted by a global CloudFront distribution that will also be available no matter what.
InfinStor takes care of any complications from high availability. It is completely automated and users will not have to do anything administratively to take care of failures of availability zones.
Disaster Recovery in InfinStor MLflow
Disaster recovery is addressed in our system but there is a manual step that users need to perform in order to make this service available in a second region.
In this architecture, if a complete region goes out, the CloudFront is a globally distributed CDN. The static HTML and JavaScript files are still available. If we use global tables, which are highly recommended for users who desire disaster recovery, we can generate the table in the backup region. That table will be available with minimal loss. The transaction log may not have been transferred but that is a negligible loss given the bandwidth that Amazon has between availability zones.
The Lambdas themselves that InfinStor supplies are optional cloud permission templates. Using those for a disaster recovery process is quick and in no time, the Lambdas will be up and running, pointing to the backup database.
Finally, users need to make DNS entries that point to the new API gateway. Once that new DNS entries kick in, users can have a complete functional setup close to how it was just before the region went down, with the service available within a short period of time.
Conclusion
InfinStor MLflow offers a variety of capabilities not found in open-source MLflow. It is essential for the success of modern AI-driven enterprises.
InfinStor MLflow provides security and scalability in an enterprise grade MLflow service.
For more information on MLflow capabilities and InfinStor’s MLflow service, visit us at infinstor.com and follow us on LinkedIn and Twitter.
The content of this article was discussed in InfinStor CEO Jagane Sundar’s presentation, MLflow: An Essential Service for the Modern AI-Driver Enterprise.