Monitoring Machine learning models in production

Aditya Kumar
tech-that-works
Published in
5 min readFeb 23, 2021
Photo by Luke Chesser on Unsplash

After deploying many ML models in production, it became evident that there should be an easy and efficient way to monitor the ML models after deployment. This blog post is focused on monitoring the classification models in production.

Recently, I was working on the text classification problem which will classify the text into one of ~50 categories. Once the model is built and tested, it needs to be deployed as a flask API along with other models. Some text classification models are already deployed as an API that uses python flask to serve the incoming requests which use Gunicorn as a WSGI server and are deployed on Kubernetes clusters and trained models are stored in S3. So, the current architecture looks something like this, and a newly trained model needs to be deployed in this kind of setup

Classification Model APIs

To prepare the training data, we have started with some keywords related to categories to tag the data, generate new keywords from existing categories, manually tagging data, etc to prepare the training dataset. Handling cases, where a text might fall into any of closely related categories and also considered already deployed classifiers to tag data in some of the categories on which they were trained on. Random manual check on labels was done to ensure if tagged entries using the above methods are good enough as training data is in the range of few millions.

Then a model is trained using training data and tested on a held-out set to measure the performance of the classifier, if model performance is acceptable on test data then the model is deployed in production as an API. Now as the model is deployed, then why there is a need for monitoring the model?

Need for monitoring

You have already put the classification model in production, which performed very well on your test dataset. But,

  • How do you know that the model is performing well on the new data?
  • How do you know it is time to retrain the model?
  • How to know the effect of data drift and conceptual drift on the model?

One possible way to detect the performance of the classifier is to take all incoming requests for classification for a certain duration and manually label the data and compare it with model predictions. Do this exercise periodically after a certain period of time to gauge the performance of the model. This is the same thing that we did while training the model and needs to be done periodically. This process seems unintuitive to me and requires a lot of manual effort and periodically monitoring of models which become very cumbersome when you have many classification models running in production.

There is a need for something simple, quick, and yet very intuitive which gives an idea about how models are performing in the production. Also in the current setup, how models are deployed and used by downstream applications is pretty stable, therefore don’t want to invest too much time in getting a look at available ML tools that provides this functionality out of the box.

Prediction Distribution of Model

After some research, having a dashboard that displays the plot for the prediction distribution of the incoming requests seems very intuitive to me and will also answer some of the questions, like:

  • How a model is performing against each category?
  • Does prediction distribution follow a similar pattern as training data?
  • Is the model biased towards any category i.e. model is predicting some class very often?
  • Is the model failing to predict any category?
  • Is there a need to retrain the model?
Distribution of predicted classes

What else can be tracked?

A new model is trained to predict the text into one of the 52 categories and uses the BERT-base-cased model, so to deploy that in production and staging we have to increase the resources significantly in comparison to previous models so that model can run smoothly on CPU.

Generally, when it comes to deployment, there are two environments Staging/UAT and Prod, and there is a significant difference in these environments in terms of resources allocated to the application like memory, CPU time. The idea is to allocate more resources to the application in production so that it can serve its purpose without any issue. In our case also, in production number of workers running are 2X times as of staging environment, hence resources needed are also doubled. Therefore we want to know that do we really need the increased resources in production?

That's why we want to track the number of API calls which will eventually answer a few questions like

  • Is there any need to increase or decrease the resources in production?
  • Can the whole system be deployed as batch inference in case the number of API calls is less?
  • Are previously deployed models still being used by downstream applications and if so, then how frequent?
Total API Calls

There might be some more metrics we can measure like, the responsiveness of APIs. But here, the main focus was to know how the model was performing in predicting the categories and keep the effort very simple to track down these metrics.

Here is the sample code for generating divs for Bar plot.

Generate HTML Divs for Bar plot

Below, HTML template can be rendered from the Flask module by passing the HTML divs generated by the above code.

@app.route("/monitoring", methods=['GET'])
def monitor():
divs = generate_div(predict_dist)
return render_template('monitor.html', div1=Markup(divs[0]), div2=Markup(divs[1]), div3=Markup(divs[2]))

--

--