The Importance of MLflow Experiment Tracking

Published in

InfinStor

4 min readSep 30, 2021

MLflow has a capability called experiment tracking which is essential for data science and building AI applications. Simply put, it allows users to log and query their machine learning experiments.

InfinStor’s enterprise grade MLflow service includes the experiment tracking component of the open-source MLflow but is a highly scalable cloud-native implementation. Let’s explore the concepts of MLflow tracking.

Comparing Data Science

It seems there is much confusion in today’s industry regarding the overlap between data science and computer science, specifically with AI application or model building and traditional software engineering.

While both fields indeed use computers and digital processing, the methods in data science are more statistical in nature and are similar to those in experimental laboratories, where chemistry or physics experiments are performed and new discoveries are made.

*The data science aspect of AI is actually more similar to the wet or dry lab experiments that people perform in traditional science than it is to software engineering.*

Version Control

Version control is an important aspect of software engineering for instances such as keeping track of bugs. One popular tool for versioning is Git, which excels at monitoring differences between software versions.

However, source control systems like Git are not suitable for tracking data science progress, experiments, and runs. In this case as well, data science and software engineering do not share similarities in source control.

In fact, data science shares yet another similarity with experimental laboratories when it comes to tracking. In the latter’s case, tools such as lab notebooks allow scientists to record details and steps of their experiments and procedures.

For example, the FDA will require drug companies to supply lab notebooks for all the experiments conducted. And for companies inventing new types of materials, battery technologies, or nanotechnologies, the lab notebook is a critical component of every experiment performed.

*Likewise, in data science, the AI applications that track experiments are quite similar to lab notebooks that track lab experiments.*

This is where MLflow tracking comes in. MLflow is a fantastic tool for tracking experiments and it gives you all the facilities that you need to keep a complete record of your own experiment progress.

MLflow Tracking in Action

Consider the case of an XGBoost program used to train a model and the two lines of code shown below. Line 1 is added at the beginning of the experiment and then line 2 is added before work is begun.

1. mlflow.xgboost.auto_log()
2. mlflow.start_run()

That experiment run is logged in the XGBoost experiment and the parameters or metrics of the experiment run are logged. Any recorded artifacts stored models are also logged.

Experiment tracking is very important. For data scientists, the ability to track work and stay organized is critical. Often data science consists of experiments and unknown territories such as new types of algorithms and parameters applied to those algorithms.

It is an iterative process where the experiments are tweaked constantly. Automatically performing these activities in the background gives data scientists the ability to go back and look at some of the experiments done. Tracking is a critical component of an organized data scientist.

Collective Knowledge

With experiment tracking, enterprises can build collective knowledge. An enterprise-wide MLflow as a service, such as InfinStor, gives data scientists the ability to look at work that other users have done, subject to permissions and authorization.

Once a user has an authorization to view an experiment run, they can go back and look at experiments that others have worked on the same data set.

Another valuable thing for enterprises is any work done by data scientists who are no longer part of a company is perfectly recorded for continuity.

InfinStor is efficient about shutting down virtual machines when idle activity is detected.

There are often regulatory requirements. The regulatory requirements can come from organizations like the FDA or FINRA for financial applications. Sometimes enterprises may need to prove that they did not deliberately incorporate bias into their models.

A full-blown tracking system is needed to make sure that provenance can occur. This is absolutely critical for the modern AI-driven enterprise.

Patent Protection

Finally, patent protection of some of the inventions created using data science experiment tracking is critical.

Intellectual property protection has changed and the age of AI driven models has arrived. AI models are not like software patents or new invention patents.

These are built using iterative experiments, and in order to prove ownership of the IP, a record of the experiments performed to build the new inventions necessary. It is another reason why a modern AI-driven enterprise must track all the experimental work that is done by the different data scientists.

Conclusion

It is clear that MLflow experiment tracking is a critical component of AI applications. And it is important to note the differences between tracking systems in data science, software engineering, and laboratories.

InfinStor MLflow provides security and scalability in an enterprise grade MLflow service.

InfinStor MLflow offers a zero configuration hosted MLflow implemented using Lambdas and DynamoDB. It has the full open-source MLflow capabilities including tracking, projects, and serving.

For more information on MLflow capabilities and InfinStor’s MLflow service, visit us at infinstor.com and follow us on LinkedIn and Twitter.

The content of this article was discussed in InfinStor CEO Jagane Sundar’s presentation, MLflow: An Essential Service for the Modern AI-Driver Enterprise.