Why I Love MLflow

And You Should Too

John Aven
Hashmap, an NTT DATA Company
6 min readNov 5, 2020

--

Whether you are a data scientist, an ML engineer, a data analytics manager, a compliance officer, or the CEO (or anyone else whose job depends or will eventually) — you need to know what MLflow is and why you should be using it.

When a data scientist builds a model, they go through a lot of experimentation; some even intentionally apply the scientific method. But unlike most scientists — or rather, exactly like most scientists — their record-keeping habits are generally shameful. I can tell you this from experience. When it comes to any computational science, these bad habits lead to issues within some industries — any industry that needs to adhere to strict compliance mandates. In turn, these bad habits lead to legal issues, lack of evidence, no management insight, loss of results, and much more.

But let’s face it — there is no reason with today’s modern computing solutions to do this in a computationally focused scientific field (well, most of them really are anymore). We have tools, lots of tools, and IMHO (maybe not so much on the H). MLflow is the best tool out there in this space. But why do I love it?

1) Combined Model Management & Experiment Tracking

This first one is a boon. It is the core functional aspect that MLflow is filling. With small but simple changes to the data scientist’s code, they will remove the need to manually track the parameterization they use in each run using a spreadsheet, manually in a notebook, and considerably arcane approaches.

All of these functions in a manageable and trackable application

MLflow organizes results in experiments with experiment nesting. With this capability, it is no longer necessary to use the haphazard and error-prone methods of yesteryear. The organization or these variants are tagged in time and with unique IDs. While you can manually override this behavior, I would caution doing so.

2) Artifact Storage

Artifacts are things that need to be saved. This can be portions of data with specific needs (labels, predicted results, etc.), images generated as visualizations of the data, and in some corner cases — models (although they can generally be handled above, there are cases where this is the better option). Since each experiment is managed independently within MLflow, there is zero needed to use some cockamamie naming scheme like

  • _new, _old, _new_old, _old_new
  • _1, _2, _2_1, _3
  • etc.

All through which it is impossible to track this history of these solution pieces over time. In addition to this, we now have a convenient storage mechanism to easily pass data throughout a machine learning pipeline without explicitly linking the various steps.

What we (as scientists) have been doing for way too long.
Here an experiment has stored all the stages of the experiment, and in the modeling stages, the versions of the model therein trained are logged. Naming conventions should be applied to uniquely tag the model or store it in a ‘run’ unique directory. However, that will be tracked to the experiment in which the model is associated. The same applies to data assets that can be stored as artifacts.

3) Model Deployment

Deploying models has always been a headache. Sometimes you can use a pricey framework, but other times you need an inexpensive alternative. Instead, build your own solution to manage this — which is often not a great solution since it is yet another tool to maintain — but this isn’t always escapable. A starting place or layer to wrap in a facade, however, always helps.

MLflows serialized models can be served from MLflow with their standard REST API or retrieved through its programmatic API. In many situations, the REST API is great and can be used — at least internally. For other more advanced situations, the programmatic API is excellent and provides a lot of flexibility.

4) Multi-platform & Framework Integration

Now, buying into a tool that doesn’t support and has no plan to support other tools and platforms you are working with can be hard to buy, especially when you can’t extend it to do so with ease yourself. Fortunately, MLflow supports numerous platforms and will continue to support even more. For me, as a consultant and a technologist in general, not only do I need to be able to provide solutions that fit a client's needs, but I have a personal need to continue to learn the newest things out there. That requires tools that will easily integrate there. Some of the supported integrations are:

Languages

  • Python
  • Java
  • R
  • Rest (language-independent integration)

Platforms/Tooling

  • Azure ML
  • AWS Sagemaker
  • Scikit-Learn
  • TensorFlow
  • PyTorch
  • fast.ai
  • Spark & Databricks (some extras in Databricks)
  • spaCy
  • XGBoost
  • And many more…

While this is not all of what is core in functionality, it is not all things present. When you need an integration that is not there, another cool functionality can provide custom integration via the plugin API. This comes in handy when you need to extend the functionality of MLflow beyond what is there out of the box. Whether you are a large enterprise or a start-up that is just getting things going, you will ALWAYS find something you will need custom functionality.

Anything Else?

So, what’s left? Most obviously is data — versioning of data and storage of said data. While data can be stored as artifacts, this storage medium should be limited to smaller datasets. Larger datasets can be linked as assets in this way, but the integrations here are restrictive from a data warehousing aspect (think extending this for your needs with the plugin API).

MLflow also provides an abstraction for a deployable reusable component — a project. This is a dockerized solution with Anaconda — a python variant packaging commonly used in machine learning circles.

How Hashmap Can Help

The next steps are deciding whether MLflow should be part of the data analytics solution for your organization. Hashmap can help you here. Our machine learning and MLOps experts are here to help you on your journey — to bring you and your organization to the next level. Let us help you get ahead of your competition and become truly efficient in your data analytics.

If you’d like assistance along the way, then please contact us.

Hashmap offers a range of enablement workshops and assessment services, cloud modernization and migration services, data science, MLOps, and various other technology consulting services.

John Aven, Ph.D., is the Director of Engineering at Hashmap: providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes.

--

--

John Aven
Hashmap, an NTT DATA Company

“I’d like to join your posse, boys, but first I’m gonna sing a little song.”