What Matters for Machine Learning in Production

(the post is originally published here: http://www.clari.com/blog/what-matters-for-machine-learning-in-production)

A few weeks ago, I attended and presented at the Data Science Summit, an annual conference organized by Turi. One of my favorite presentations was Professor Carlos Guestrin’s talk, which covered insights from Turi’s interactions with data scientists from assorted domains, including recommendations, fraud detection, click prediction, forecasting, lead scoring, and churn prediction. Carlos listed a couple of must-haves for machine learning in production, which, I think, are spot-on and echo the experience of my own team in practice.

1. Maximize resources. Reuse features.

Many data scientists and machine learning engineers spend over 70% of their time on things other than modeling. Thanks to the popularity of various machine learning libraries like GraphLab, Spark, scikit-learn, and TensorFlow, once you transform related data into the right format, modeling itself is the easiest part of standard machine learning problems. In comparison, data scientists spend tons of time on data wrangling, of which feature engineering is one of the most important tasks.

As the term suggests, this is engineering work; not much science or any other discipline gives instruction on how features should be defined. In practice, feature engineering, which requires domain knowledge and experience, is more a trial-and-error process. You define certain features and test to see whether they bring any additional metric gain to the machine learning application. If it works, you’re lucky!

Yet Carlos observed that machine learning applications can be categorized by data vertical, thus making feature engineering generalizable. For example, tf-idf should be the de-facto standard when dealing with text categorization. Time series or images also define their own ways of generating features. It is therefore possible to abstract the typical feature generation process so data scientists can simply reuse those features and be more productive.

2. Never stop learning.

Machine learning models should adapt to real-time updates and feedback. For many recommendation system or ranking problems, a complicated machine learning model is typically built and refreshed once in awhile. For instance, with a deep learning model, training can take hours or even days; you can’t always afford to update the model as frequently as you want. On the other hand, converting a machine learning model directly to incrementally update is easier said than done, depending on the model.

Carlos suggested an alternative solution: keep the complicated machine learning model to generate a pool of candidates. Then, another lightweight model can be trained in real time to re-rank candidates in the pool. This can be interpreted as a decomposing pattern into longer-term trend and short-term fluctuation, with the complicated ML model to capture the general trend and the lightweight model for the fluctuation.

Though I doubt whether such a decomposition really works in practice, it’s quite a smart idea. It depends on whether the fad is appearing as one candidate in the pool. If the candidate pool misses the fad, there is no way to bring it back. But in real world machine learning applications, we solve problems with constraints. Compromise is not uncommon.

3. When scaling matters.

Scaling has been a major topic in the machine learning research community — it’s how GraphLab and Spark started in the first place, enabling machine learning applications to process increasing volumes of data. Yet too much effort has been spent scaling up training, which is just one step in the machine learning pipeline. In production, model training tends to run offline and can take longer if the model is not updated frequently. The true pain should be end-to-end development and deployment time.

A related question is how data science teams collaborates with other engineering and product teams. In my experience, the data science team tends to follow a different pace from other teams. As a SaaS company, we push releases every 2 weeks. But for machine learning applications, even two weeks can seem too long if one model turns out to be better. A general practice is to set up the data science/machine learning application as a microservice, and update/deploy model as frequently as needed without changing the REST APIs exposed to other components in the production system.

The programming language used for DS production is another issue. For most data scientists, python/R are the top choice. Other teams may prefer a different language, like Java. Should data scientists adopt python to write production code? Of course, with a microservice setup, the language used for implementation is independent. But it feels like a waste to rewrite many utility functions/drivers that have been developed by other teams. More likely, the machine learning application is embedded in the production system. So any findings or model updates require a double implementation, first in python and then in Java, which seems a waste of resources. If you have a good solution to the problem, I’d be happy to chat.

4. Explain yourself.

“How can I trust your model?” We encountered this question frequently when we demo our machine learning applications to customers. While a black box model that works like magic may look cool, people still question its applicability. For many applications, users do not trust commonly used ML metrics like accuracy, AUC or NDCG. They need to know how a prediction is computed. In our application, that’s why we show both the score that predicts how likely a deal is to close and the top factors contributing to that score.

But not every machine learning model can be so easily interpreted. Take deep learning trained on images as an example. It is not so straightforward to figure out why a cat is recognized from the image given millions of neurons in the model. This set up what I found the most interesting part of Carlos’s talk. His team developed some approach to interpret a prediction by showing top relevant features, without the necessity of knowing what is going on behind the machine learning model, which he explains further in his paper on the subject. Isn’t that crazy?!

All the factors discussed can be translated into transferable features, up-to-date predictions, resource-aware scaling and interpretable models, respectively. In abbreviation, TURI :-)

At Clari, our data science team deals with machine learning in production everyday. Today, we are trying to address two main problems: opportunity scoring, and forecasting. We spend a lot of effort bringing in signals beyond CRM and identifying features that may help predict deal outcomes. When showing the probability of close for a particular deal, we also surface top contributing factors. But machine learning in production is not easy. Many thorny challenges lie ahead, including scalability, consistency, robustness to outliers, etc. If you would like to learn more about Clari’s machine learning technology or have an idea to address those challenges, please drop us a note.