Retraining is the Only Constant, or, The [Machine] Learning is Never Done

Published in

T-Mobile Tech

9 min readMay 13, 2020

“Staircase” by Colin Tsoi is licensed under CC BY-ND 2.0

In previous posts, Jacqueline and Heather from my team described how we were able to go from exploratory analysis and modeling in R to production inference services in R using docker containers. Net — less rework and less opportunity for error than our other alternatives and shorter cycle times to boot! Data Scientists and DevOps are both happy and that’s a huge win.

We’re applying ML at scale to support on-going mission critical activities. So what’s missing from what we’ve described so far? End users and change over time.

Our business changes, our customers change, and our end users’ needs change. The predictions of our production-deployed R services (and the functionality based on those predictions) have to change right along with them or the products we provide will quickly be discarded. This first became critical for us with a model we had to retrain at least daily. As we have described our patterns so far, the models themselves are built into the container (as rds files, h5 files, or other R-friendly mechanisms), requiring redeployment for a model change. This is somewhere between inefficient and untenable for a model that changes daily.

First, a word from our lawyers!

The code in this blog may need to be modified for your own environment. We will make our best effort to update but we make no guarantees on that. It goes without saying, but you are responsible for respecting the data you use and ensuring best practices for cybersecurity. In general, before using in production, please have your internal IT department (if they exist) review your code and your data. They know your environment better than we do, and we are not in a position to provide general support. In legalese, “Use is AS IS and without warranty.” Your IT department (if they exist) should be aware of the open source licenses for some of these tools. Note that different departments have different policies about open source.

Motivation

First a little context to provide the motivation. One of our products for our Customer Care organization automatically surfaces internal wiki articles (among other information and resources) based on the content of a conversation, the state of the customer, and the usage history of other agents in similar conversations. We use a variety of ML models operating in real-time with the conversation to select what content to present.

One of those models uses conversation text and other data to predict which internal wiki articles will be useful to an agent at any point in a conversation.

A customer care conversation and tool view with recommended articles on the lower right

As this is a wiki, both the content and usage change over time. Just as important — net-new content is introduced before applicable cases arise (for example content about new products, services, and offers). New content, changed content, and changes in usage patterns can each give rise to situations where today’s useful content predictions will be irrelevant (or at least sub-optimal) tomorrow.

If this were a simple database query, we’d change the data in the database and move on. It is not, however; it is a model trained on a combination of article text, customer conversations, and agent usage information. Some form of recurring retraining is required. In the case of the wiki model, we aligned to the potential publication of net-new information. That is generally a daily process so a daily retraining is where we started.

Automated Retraining and Production Deployment

Ideally, we want to train the model off line then deploy it such that the running R services can pick up the new models on the fly. That is exactly what we’ve done.

There are more than a few concerns that might arise. These include:

In what form do we store the models?
Where do we store the model info that is both cost efficient and readily accessible to our R services?
How do the R services know a new model is available?
How do consumers of the model output know what version of the model a given result was based on?
How do we automate the retraining in a robust and reliable way?
What is the impact on performance of reading the new model on the fly?

The form of the models

Keeping with our preference for simple approaches (or at least as simple as the problem allows), we kept the form of the model unchanged. That is, if the Data Science team built an R service that relies on model data stored as rds files, we used rds files. If it relies on h5 files, we use h5 files.

Where to keep the models?

Many of our services run on AWS. As such at least two options leap to mind with which the team is familiar: DynamoDB and S3. We went with s3 for cost and simplicity. DynamoDB is a key part of our ecosystem, but nothing about our use case here called for more than simple storage.

How do the running services know there is a new model?

Again, keeping it simple we have the R services reload the then-current model info from S3 when servicing an REST call if it has been longer than a configured interval since the last model load. Currently, the models are reloaded on receipt of the first REST call at least 3600 seconds after the model was last loaded.

How do model consumers know the version of the model used for any given result?

Ok, so perhaps it wasn’t just simple storage we needed. S3 allows us to turn on object versioning. With versioning enabled, S3 assigns a unique version id to our model when we store it. When we retrieve the model, we also retrieve the version (the x-amz-version-id property). The R services include that value as metadata along with the inference result itself.

library(aws.s3)
 
 ...
 
     model_info <- s3readRDS( 
       object = paste(
         config$model_key,
         'model_info.rds',
         sep = '/'
       ),
       bucket = config$bucket_name
     )
     
     response <- head_object( 
       bucket = config$bucket_name, 
       object = paste(
         config$model_key,
         'model_info.rds',
         sep = '/'
       )
     )
     
     version <- attr(response, "x-amz-version-id")
     version_timestamp <- attr(response,"last-modified")

How do we automate the retraining in a robust and reliable way?

In our environment, the periodic retraining of these models is not optional. It cannot rely on an engineer invoking a command or a cron job on a single EC2 instance. Enter the Kubernetes CronJob!

We use K8s CronJobs to run the model retraining and allow K8s to handle providing a robust implementation periodic invocation.

What is the impact on performance of reading the new model on the fly?

Finally, you might say “That’s all great, but can you do it and still meet your latency requirements for model inference REST calls”? Good question!

One model we have externalized this way has a regular performance that looks something like this across peak traffic periods (excluding model reloads). Note I’ve taken these metric snapshots during the COVID19 crisis and core operating hours for the systems involved are 11:00 to 02:00 UTC.

95% of all queries returning in under 250 milliseconds is well within acceptable range for the service in question. That would be true of the p99 as well except for that spike near 11:30. More on that spike in a moment.

On to the impact of dynamically loading new model data!

First, of course, we try not to reload models during core operating hours. Our K8s CronJob lets us reload at 08:00 UTC the services are least used (and when almost all of our team has packed it in for the day). What does the performance look like then?

While core operating hours are limited, there are still queries being serviced even when we do our reloading, so there is some data to look at. A few things to note here.

The y-axis is logarithmic (base 2) to show detail.
p75 stays within normal bounds (less than 250ms).
p95 stays mostly but not entirely below the 350 ms mark as well.
p99 does show spikes over the course of a few hours. Let’s look at that a bit more.

To be sure, sustained response times greater than half a second would be a problem (never mind 4 seconds!). However, these aren’t sustained. 75% of cases stay within the normal bounds. Even the 95% line stays close for most of the period. It is worth noting that traffic levels are quite low, as well, so a single slow response constitutes a greater fraction of total traffic than it would at peak. At 08:10, there are about 20 calls per minute versus the 2000 calls per minute during core hours.

What drives those spikes in response time? Reloading the model, of course.

Why do the spikes go from approximately 08:00 up to that spike at 11:30 we noticed earlier? Reloading happens when 2 conditions are met. First, it has been at least an hour since the model was last loaded. Second, the service instance in question receives a REST call to process. In the worst case, an instance loads the previous model moments before it is replaced and then it waits the longest possible time to be load-shared a REST call to handle. As traffic is between 1% and 25% of peak until 11:30, that can be quite a wait.

Where do we go from here?

Even if this were a perfect solution (it isn’t), the world changes and it would not remain perfect long. So what are some of the backlogged items for us in this area?

We would like to have a better model versioning solution that ties in with our requirements management, source control, and artifact repository systems. The s3 object version number works, but the lack of metadata and isolation from the rest of our environment poses a real challenge. To name just one — tracking down training data for a given model version means working backwards through logs and reconstituting the source data as it was at the time. That isn’t always possible. This is one of several reasons we’re looking at MLflow, Kubeflow, and others. More on that another time.
For the wiki content model mentioned in this article, a daily retrain is good enough (so far). In other cases, we would benefit from greater flexibility. Some problems have more volatile behavior potentially requiring retraining more frequently. Some training is more expensive than others, as well, so increasing the frequency of periodic retraining is too costly. We’re working on triggering retraining based on metrics we collect and automated A/B testing of model versions in addition to the periodic approach described above.

Wrapping Up

With the successful production operationalization of inference services in R, we faced the need to update the underlying models with minimal disturbance to production. In particular, our recommendations of internal wiki content depend on timely incorporation of usage information as well as incremental and net-new content. To address this, we externalized the models themselves to S3 and changed the inferences services to periodically reload. We then put the training and model publication processes into reliable Kubernetes CronJobs and pushed “go!”. There are performance impacts when model reloads take place, but they are short lived and can be timed to coincide with low-usage periods for the services in question.

Enough about us…

Have you faced similar situations? How have you solved them? I would love to hear about it! Let us know in the comments.