ML Ops development: Moving from a tool-based to a conceptual-centered approach.

Published in

OVRSEA

7 min readApr 6, 2023

Picture from Dan Cristian Pădureț on Unsplash

TL;DR: The strong need of dedicated technologies in ML Ops is often the base of mis-understanding of what ML Ops is really about. While ML stack represents only the field of possibility, holistic concepts such as monitoring or alerting for instance, act as fundamental bricks that drive the development of an entire ML project plan. These concepts bring the right questioning without constraining your vision to the ML Ops tool you considered. By envisioning our ML projects from a concept & need perspective, we are able to deliver great ROI even without state-of-the-art tools.

1. ML concept approach

ML Ops is one of these disciplines where everything changes from one company to another depending on the stack. Entirely focusing on putting the model in production and monitoring it afterward, ML ops practice is fundamentally backed up by hardwares and tools that ensure the model survival. This strong need for dedicated technologies can often lead to a complete misunderstanding of what ML Ops is really about. The job is not really to use state-of-the-art tools to handle the model in production but to deploy & monitor it adequately to retrieve the best ROI of it. Thus the predominance of tools importance in ML ops is undermining the ROI estimation process of such a model. Some companies tend to use the most trendy technologies without considering the concrete need of the project.

But thinking simply in terms of need seems also a bit shortening in terms of project development. Indeed, the need, while being at the root of the project, is never the right ally when it comes to project roadmap, and this is where ML concepts intervene.

What I meant by ML concept is ideas that should be holistic and not project-dependent and that drives your ML development plan. They are fundamental bricks that guide you from the R&D to the production. Contrary to specific technologies use, concepts are always relevant no matter the project and they trigger the right questioning. These questions will guide your development while the project’s fundamental needs will provide the answers.

The ML concept is your question and your need will provide the answer.

ML concepts include monitoring, alerting, accuracy metrics, user adoption, and health checks for instance, among others! I tried to give a quick and partial overview in the following schema.

Non-exhaustive schema of some ML concepts and the questions they triggered.

They often come from (bad) experiences or best practices. Thus, at each postmortem, try to really define what concept you missed before starting the project and what impact it had afterward.

For example we could think of a conversation between a Data Scientist (DS) and a Product Manager (PM) such as:

DS: How do you want to monitor the ML model?
PM: Hum, didn’t think of that… Well for this model I would need a to have ROI information pushed to me every month as this ML model tends to generate quarterly-based revenue.

And here you will notice that nothing is about tech. We don’t talk notebooks, MLflow or BI dashboard with relevant metrics. We don’t evoke them because probably the right solution will not be among them. We don’t think in terms of solutions. On the other hand the monitoring concept in itself is enabling the Data Scientist to ask the right question to its PM and avoid a pitfall such as model drifting.

As the PM is often closer to the Business team, he will be able to specify what he needs. And then and only then, based on the PM answer, the Data Scientist will be able to look for the right tool to implement the monitoring.

2. Ramp-up on technologies following your need

Tech stack represents the field of possibilities. And obviously, you want to know what you can actually do, compared to what you want to do regarding your needs. In addition tech stack is also valuable to help you discover new concepts. Indeed specific tools are responding to specialized needs that you might not have forecasted.

So yes I won’t say that you should not use dedicated ML Ops tool at all. However what pushes you to use it, is key. Are you really using a tool based on experience and need, and do you know why it represents value for your business?

In all teams and businesses, you have to ramp-up. You are starting small with a data team of one or two members, and hopefully you will end up with a lot of people and sub-teams specialized in certain ML fields.

By following concepts, you will arrive to the right tech maturity that you need in order to solve your case. And probably, while your company will grow bigger and bigger, more maturity will be needed. However, there is no added value in targeting state-of-the-art when you don’t need it. Indeed if you have a too high level of technology for your maturity or need, it will certainly:

cost you more money
require more maintainability

So don’t be afraid of not being perfect. Yes, ML Ops is really different from what you can learn on in uni. You don’t have always the resources you need or develop in an ideal dev environment. While data science part is often done in a very controlled environment, ML in production can be dirty and messy. You have to handle the computing resources, the alerting channel, the monitoring possibility, the feature store… It seems that you will face many issues and this is where it is tempting to use a state-of-the-art tool per issue and decide that this will “solve” your problem. But it will not! Solving is not about answering perfect, but enabling ROI. So, really focus on what matters and mature the technology in phase with your need.

3. Use case: Health check alerting with Metabase

Let me give you a quick use case where concept over tool was quite a decisive approach in one project at Ovrsea.

We had just put a model into production that was giving our operational staff some weekly recommendations. This model was very sensitive and yet it was only a first draft version. We were working with a product manager to know how we should monitor this system.

While checking for our need, we found that the most important factor for success in this project is the ability to provide dynamic recommendations, which is the key in this project to maximize return on investment (ROI). Indeed here recommendation are only here to challenge the ops, and so we didn’t so much care about the actual figure given as recommendation, but the fact that recommendation itself was updated every week.

So we had to make sure that this ML model was working well and was being able to push new recommendations every week.

We could have thought of conventional alerting and monitoring system such as: Graphana & Prometheus, Neptune.AI or Amazon SageMaker alerting for the end point. However, we don’t use much these technologies for now and our need for this first version of our health check alerting was very simple. Moreover, this model is based on multiple bricks: feature store, end point model, and additional input to handle from users. Its architectural complexity is quite high and adapting state-of-the-art tech would have taken ages. At the same time, we know that while complexity was high due to interface between multiple bricks, the bricks itself were very simple and so errors would be easy to identify if we had a global alert.

That’s why we decided to put an alert system on Metabase which is our BI tool. This alert would warn us if a certain amount of recommendations were not predicted and pushed each week. In addition to this alert, we created a very simple dashboard with key BI metrics.

Alert system on Metabase: the goal line represents the minimum number of updated recommendation that is required per week for the alert not to be triggered. Here we know that our system should at least push 2 400 recommendations per week.

If an alert is raised, then it worked as a health check and we received both the PM and the DS an email and a slack message. Then we only need to dig a bit among the different bricks in production to identify the error. In the meantime this alert also guarantees the project impact as dynamic recommendation update is the main key to unlock ROI.

I agree, this is neither a fancy solution nor a ML ops specific tool as we are using our BI tool to do the monitoring of a ML model. We also might not have the degree of observability we could obtain with other systems. But yet it was:

very easy to put in place
cover our fundamental need regarding health-check monitoring and ROI estimate.

We followed concept and best practices of monitoring that lead us use a modest solution that perfectly fills our requirements.

Conclusion

I think Data teams would gain in considering the problem with a paper and pen and try to focus on what really matters through concept and need. Once they have that, ML tools will obviously be needed to fulfill the task, but it should not polarize the debate.

However, while being very much more in favor of being pragmatical and ROI focus on project, I think that having a healthy routine of tech discovery outside project is key to be up-to-date and be able to know what is your field of possibilities. This would have mostly to do with having a healthy pedagogy culture in your data team. So be focus on ROI in your project, but also mature outside of it and ramp up with your need!

Thank you for reading and feel free to reach out! We will be more than happy to discuss and debate ML Ops best practices with other companies!

ML Ops development: Moving from a tool-based to a conceptual-centered approach.

1. ML concept approach

2. Ramp-up on technologies following your need

3. Use case: Health check alerting with Metabase

Conclusion

Written by Paul Couturier