How MLOps helps keep Machine Learning solutions relevant during challenging times

Nikos Volakis
Datasparq Technology
5 min readDec 17, 2020

Now that the end of the COVID -19 pandemic is in sight, we can safely claim that this period of uncertainty and ongoing volatility brought new challenges for organisations and teams managing data science projects and more specifically predictive models.

By delivering many Machine Learning solutions for our clients we understand how tricky it is to manage and monitor a model. The current turbulent times highlight this issue even further, with production models degrading at an accelerated rate, heavily affecting business-critical processes that rely on the models’ output.

Organisations’ ability to manage model risk and model monitoring is no longer “a nice to have” exercise for data science and IT teams overseeing these models. It becomes a collective vital process which also includes business stakeholders that rely on the models.

MLOps

Model lifecycle management is commonly known as MLOps, an extension of the well known DevOps that is used for traditional software development. MLOps includes frameworks, technologies and best practices to provide a scalable and well-governed way to quickly deploy and manage the lifecycle of ML solutions in production.

But let’s have an overview on the necessary areas an organisation needs to focus on to achieve robust MLOps:

  • Deployment: Here is where the convergence of the data science (DS) and IT team is happening.
  • Monitoring: Here we make sure we have configured all the necessary monitoring solutions to assess model performance and quality over time
  • ML lifecycle: Here we make sure we have tools and techniques in place that allow us to manage the models during their lifetime
  • Governance: Here we establish the framework for setting rules and controls for ML models in production

Investing in MLOps allows organisations, and their ML solutions in production, to be more resilient to external volatile events, like rapid market landscape changes, regulatory changes, and other unforeseen external events like the COVID-19 pandemic. Furthermore, the productivity gains a data science team achieves are significant: they allow the Scientists and Engineers to focus on tasks that add value to the model rather than firefighting a system in duct tape. This happens because MLOps seeks to create patterns and automate many of the manual processes involved in the DS solution lifecycle; it does this in order to minimise the risks associated with managing models in production.

‘Let your DS teams focus on what matters most’

Data Drift

One of the most common issues to tackle with ML Ops is Data drift. Data drift occurs when training data and production data diverge over time, and, in consequence, the model loses predictive power. It is crucial to be able to detect this pattern and correct it, before it disrupts the production model’s performance.

During the peak of the pandemic we observed extreme situations: for example, one of our clients uses a machine learning solution in production to predict the propensity for their customers to pay any outstanding sums of money that they owe. The client’s cash collection process has a significant cost associated with it, and so it was vital for our client to correctly identify those customers with a low propensity to pay so that they could avoid the costly and long cash collection activity; and in the process saving money with no impact on revenue.

It was thus vital for our client to correctly identify those customers. But here is the catch: A model in production that was making them money in March 2020 could lose them money in April 2020 by miss-classifying customers. The data drift was significant. In the next section we will see how we can mitigate the risk.

Data Drift and unmanaged models can “kill” your ML solutions and cause serious damage to your organisation

Monitor and Retrain:

Robust monitoring, continuous testing and being able to fast retrain and iterate different versions of the model in production help us to mitigate the risk.

DataSparQ has its own testing and monitoring Suite called Xu to address the data drift and other issues. Xu implements, among others, a Kolmogorov–Smirnov test, which gives a statistical measure of distance between the distributions of the two datasets. This distance is measured for each new version of data and displayed in the Xu dashboard, where drift over time can be seen visually.

When data drift is detected we can mitigate it by triggering a retraining process of our models on the newer data or re-engineer the features of the solution to adhere to recent changes before retraining. By doing this we end up with a newer version of the model that can be used as long as necessary before reverting to the previous version, when and if the external environment normalises.

Apart from data drifts and model degradation, organisations will have to tackle all the traditional service related issues that can disrupt ML pipelines, such as run-time errors, data errors, system outages and scalability issues, among others. Here is where MLOps blends with traditional DevOps. As in DevOps , MLOps tool-sets have to alert stakeholders about important model behaviours or drifts. Through DataSparQ’s Xu testing and monitoring suite developers can set thresholds for the acceptable drift level and set up automated alerts about the system’s stability to notify the team of any anomalies.

Processes involved in the lifetime of a ML Solution in production

Here are some of the processes and actions teams need to make sure they have in place when deploying ML solutions in production:

Final Thoughts

Running ML applications effectively means more than crunching numbers. It is crucial to account and plan for production level solutions so your IT teams know how to approach this new capability and your data science teams are enabled to do what they do best without worrying how their models behave “in the wild”. Looking ahead to MLOps ensures your organisation is not only ahead of the ML curve, but the adoption is smooth and impactful and can react and adapt to changes fast.

--

--