From Data to AI with the Machine Learning Canvas (Part IV)

Key aspects of ML systems to specify before implementation: constraints on the predictive engine, and how to evaluate systems before deployment.

Louis Dorard
Own Machine Learning
14 min readOct 23, 2018

--

Presenting the Machine Learning Canvas at Big Data Spain

When we’re starting out on a new Machine Learning project, there’s a temptation to grab a dataset, apply various algorithms, compute generic performance metrics, then get the value up in all sorts of ways (data pre-processing, hyper-parameter tuning, model ensembling), even when these metrics don’t have an interpretation in the application domain that makes sense. We’re so focused on these things that we end up losing track of the big picture…

What are the gains of using our ML system, for its end-user? Do they outweigh the costs to (build and) run it?

An ML system makes decisions from predictions, in a way that creates value for an end-user. In Part III of this series on the Machine Learning Canvas, I showed how to specify an ML task — once an end-user had been identified and a value proposition had been expressed —, and how decisions should be made from predictions. In preparation of the implementation of the “predictive engine” of an ML system, we should identify its operating constraints, so we can make sure to choose the right technological solutions to build it and we can anticipate the costs to run it. We should also explain how to evaluate the whole ML system before deciding to deploy it (or an updated version) to production; we want to build confidence that it won’t “break” things in production, and that it will create real gains.

This article is part of a series — check out Part I, Part II and Part III if you haven’t already. In particular, Part III contains a list of the 4 use cases that will be mentioned here (real-estate deals, churn prevention, priority inbox, fake reviews detection) and of their associated value propositions. Have a look at my talk at PAPIs Latam 2018 for an overview of the Machine Learning Canvas (MLC).

Making Predictions

Technical constraints on predictions made to support decisions: volume, frequency, time, etc.

Predictions. Delivered exactly when you need them.

Frequency and volume

After filling out the Decisions box (see Part III), we would know when predictions should be made, but not necessarily how many of them and at which frequency. For churn prevention, there would be 1 prediction per customer per month—it would be useful to give an idea of how many customers there are. For priority inbox, we would want to estimate the number of users and the average frequency of incoming emails. For the real-estate deals case, how many new properties can we expect on the market every week? For fake reviews detection, how often do we get new reviews?

In all these examples, the number of predictions to make is the same as the number of decisions. However, that may not always be the case. For instance, if you would build a marketing campaign personalization system, for each customer you would be making as many predictions as there are personalization options (so you can determine which is the best).

As you’re developing a better understanding of which predictions will be made and when, I recommend to revisit the Features box (in the LEARN part of the canvas). It’s important to make sure that, at prediction time, you’ll be able to use all of the features you listed. Consider reimplementing priority inbox in Gmail, for instance. You may be tempted to access the isStarred property, which is available in the Gmail API and tells whether an email was starred or not. You could extract a dataset that includes this variable, build and evaluate a model, and get very high accuracy. However, isStarred would not be available when predicting the importance of a new email, because the end-user would not have had the chance to interact with that email… The best way to avoid these issues is to ask yourself, for each feature, how to extract its value from the data sources at prediction time.

You could also revisit Data Collection, and reflect on differences between the inputs that will be collected and added to the training data, and the inputs on which you will be making predictions. The best scenario is when the former are representative of the latter, but that may not always be the case…

Time

The time available for each prediction would be informed by the domain of application, and by what’s acceptable for the end-user. This would include time for the model to make a prediction, plus time to extract feature values. Often times, it is actually this featurization which is the longest, as it involves accessing various databases and running feature extraction methods.

In the customer retention case, we would need to “refresh” the feature-representations of customers every time new predictions are made on them (i.e. every month, in my example); this would involve recomputing variables such as how many times the customer logged in, average time spent using the product, etc. In the priority inbox case, we would need to recompute “social features” such as time taken to reply to the sender’s previous emails.

Other constraints

In addition to prediction volume, it is useful to think about any additional constraints on making predictions, and to consider where predictions might be made. For instance, we might need our system to be robust to loss of internet connectivity, in which case we’ll want to make predictions directly on the end-user’s device.

If you’re integrating an ML system in an IoT application, predictions would either be made on the edge or in the cloud. Maybe devices will not be powerful enough to run predictions in time, and our constraints will clearly point towards cloud usage; or maybe prediction volume would be too big and would point towards making predictions on-device (assuming all the input features would be available locally). In the case where you control the hardware that runs your ML system, you would also want to choose that hardware in consideration of technical constraints. Could predictions be made in batches? (If so, you may want to equip your device with a GPU; otherwise you would stick to a CPU, or you may use an FPGA if time is critical.) But if you don’t control the hardware, you may have to test your system on a variety of hardware (to specify).

Monitoring in production

It is best practice to deploy predictive models into production as APIs. This makes it easier to integrate them into intelligent systems. The constraints on making predictions would translate into constraints for the API that powers predictions. When monitoring this API, you will be able to check that constraints remain satisfied in production, and also that the actual volume of predictions matches your expectations (if not, this could indicate issues such as undeclared consumers, which can end up creating hidden feedback loops — see the paper on technical debt in ML systems).

Offline Evaluation

Methods and metrics to evaluate the system before deployment.

Evaluating an ML system entails evaluating its model(s), and it should be performed every time a new model is created. Evaluating the model’s accuracy isn’t uninteresting, but what we really want to do is to pre-evaluate the future impact of decisions, so we can build trust that the system is ready for deployment. I like to think of the offline evaluation as a simulation, where we’re trying to answer “how well would we do if we would deploy that system on these test cases?”.

Test cases

For the simulation to be trust-worthy, it should run on test cases that are representative of the cases that the system would encounter when deployed. Similarly, we should train the system on data which is representative of what the system will be tested on. But we shouldn’t let the system know too much about what it will be tested on…

How should we split the available data into a training set and a test set? A common mistake is to split randomly. In his talk at PAPIs Europe 2018 on the limits of decision making with AI, Martin Goodson gives the example of a news classification system, which would find a rule that would be specific to DailyMail articles; if the test set also contains DailyMail articles, it will do well on these, which will be reflected positively in the evaluation. However, the system should find rules that are general to all news sites, and it may be used on sites that were not in your training data. If the train-test split is random, your evaluation results could be overly optimistic. It’s important to apply knowledge of the domain and of the problem, in order to segregate data into training and test sets.

Slide from Martin Goodson’s talk at PAPIs on the limits of decision making with AI

If inputs are time-bound, as is the case with customer snapshots, with a random split you could end up having inputs in your training set which would be posterior to the inputs in your test set. This would make the evaluation very difficult to interpret…

It is good practice to choose the test set in a way that makes it easy to present meaningful results, and to interpret them in the domain where the system will be used. One way to do this is to use the most recent data as test, so we can answer “how well would we have done if we had deployed this system X days/weeks/months ago?”. This is what I chose to do in all the example canvases.

Note that there is some ambiguity in the fake review detection canvas: it’s not clear whether this system is built for one specific website where reviews are being left, or if it is meant to be generic and applicable to multiple websites of a certain type (e.g. hotel reviews on TripAdvisor, Booking, etc.). In the latter case, we would need to segregate websites in addition to splitting time-wise.

Test set size

Another thing to consider regarding the test set is its size. In Machine Learning Yearning, Andrew Ng says two things on this…

  • “The old heuristic of a 70–30 train-test split does not apply for problems where you have a lot of data; the dev and test sets can be much less than 30% of the data”
  • “Your test set should be big enough to give you a confident estimate of the final performance of your system.”

Another way to phrase this is that your test set should be big enough for your results to have meaning in the domain of application. If you plan model updates every X days, then you should test on at least X days’ worth of data. But it could be more — for instance, if you’re modeling behavior that has a yearly pattern, you’ll want your test set to be over a period of 1 year (or more).

When evaluating your system, you also want to make sure that you can make predictions/decisions sufficiently fast (which would be determined by test set size and time constraints listed in Making predictions).

Metrics and performance constraints

Performance of your system can be measured in a number of ways, but Andrew Ng recommends the following: “Choose a single-number evaluation metric for your team to optimize. If there are multiple goals that you care about, consider combining them into a single formula or defining satisfying and optimizing metrics.” Also, if we are to decide on whether to deploy to production, it’s probably best to focus on a domain-specific metric that quantifies the impact of our system.

So many performance metrics… Which one to choose?! (Screenshot of Dataiku’s Data Science Studio)

As I said above, I like to think of an evaluation as a simulation. A good way to measure the performance of the system is to compute the gain (or reduction in cost) of using the system, compared to not using it / something else, when running the simulation. It’s a much better story to tell than reporting somewhat abstract accuracy metrics (e.g. the F1-score of a classifier). You might need to make certain assumptions for that; for instance, in the customer retention example, you should state your assumption for the success rate for your targeted retention efforts.

The first step to find your domain-specific metric is to interpret the meaning of errors. Let’s consider binary classification to start with. There are two types of errors: False Positives and False Negatives. In churn prediction, a False Positive is a customer who you think would churn, but eventually didn’t. A False Negative is a customer who you didn’t think would churn, but eventually did. The second step is to assign cost values for all possible errors. In our example, the cost of an FP would be the cost of targeting a customer, and the cost of an FN would be 0 (we lose this customer, and we would have lost them anyways without our system). Similarly, you can come up with gains (i.e. negative costs) for True Positives and True Negatives. The gain for a True Positive can be estimated by the revenue brought by the customer multiplied by the assumed success rate of our retention campaign, minus the cost of targeting the customer. The gain for a True Negative is 0. The cost/gain values are fixed for the problem at hand, and could be written in the canvas. When running the simulation, we would just count the number of FP, FN, TP, TN, multiply each value by the associated cost value, and sum everything.

Illustration of a “confusion matrix” stolen from Human-Centered Machine Learning (great read, highly recommended!)

Remember I said it’s easy to come up with situations where a perfect model would be useless? Imagine that our model predicts churn with perfect accuracy, hence we choose to target all customers predicted to churn. Imagine that the monthly revenue is 10€ per customer, that the success rate of our retention campaign is 20%, and that the average cost of targeting a customer is 2€. In that case, the gain for a True Negative is still 0, and the gain for a True Positive is 10€ * 20% — 2€ = 0! (To fix this we should either bring the targeting cost down, or improve the success rate of our campaign; the latter might be achieved with personalization of the campaign, or by finding which customers would be the most receptive.)

Let’s move on to a regression case. The cost of an error could be a function of the amount of the error, but also of the input on which the error was made. In the real-estate deals example, the cost/gain function is based on the way predictions are used in our investment strategy, and aims at estimating how much money we lose/make: if we decide to invest in what is predicted to be a good deal (asking price < prediction), but is essentially a bad deal (sale price < asking price), we incur a cost of (asking — sale price).

We would want to minimize the total cost in our simulation, but we would also want to make sure that no single cost is bigger than our bankroll — otherwise we would go bankrupt! In addition to a performance metric to optimize, it is common to have to satisfy performance constraints such as maximum amount of error, or maximum number/proportion of errors of a certain kind that we can tolerate. In the priority inbox example, we may want to make sure the number of important emails that weren’t detected as such (False Negatives) is less than 1 every X days (based on the degradation of user experience that we can tolerate).

One last obvious constraint on the performance of the system, is that it should be higher than that of the baseline. This means that before you learn any models from data, you should implement the offline evaluation and the baseline. Actually, if your baseline is a heuristic that satisfies all your performance constraints, and that you can show is valuable in the context of the offline evaluation/simulation, you should probably deploy that baseline before doing any ML — if you haven’t done so already!

Validation set and parameter tuning

Until now, we’ve discussed measuring the performance of our system via a test set. That dataset should only be used to decide if all is ok to deploy to production. You should not use the test set to tune any parameters of the model-building process, nor of the decision process. The example canvases are kept fairly simple and omit to mention that our available data should consist of not 2 but 3 datasets: a test set, a training set, and a validation set — which is used for parameter tuning and model selection.

When considering different approaches to model building, such as neural networks and ensembles of trees for example, we’ll run each of them on the training set, and we’ll apply them to the validation set in order to “validate” our choice of approach. The validation set can be used for all choices to be made when building the system: this can be parameters of the feature extraction, data pre-processing, or model-building procedures (also called “hyper-parameters”), or parameters of the decision system (such as the number of customers K to target, or the thresholds used for automatic decisions).

If you’re wondering why we can’t just use the test set for that, the reason lies in the fact that when we try different choices and compare their performance on a given dataset (the validation set), we are using that dataset to learn which choice is the best on the problem at hand, and every learning should be tested with new data (i.e. the test set).

The validation set can consist of two subsets: an “eyeball validation set” and a “blackbox validation set”, as described by Andrew Ng in ML Yearning (in Andrew’s terminology, a validation set is called a dev set). The eyeball validation set is the one that will contain cases to be examined as part of the manual error analysis. It would typically contain a randomly selected subset of the whole validation set, but also edge cases or pre-defined cases where you would want to manually review models’ behavior (i.e. the errors, predictions, and explanations behind predictions) and make sure that it is satisfactory. You may not want to list all these cases in the canvas, but you might want to give insights into how the validation subsets are created, which and how much data they contain.

As I said before with the test set, you also want to run your baseline against the validation set. You should also consider simple models (e.g. rule-based) when doing model selection, so you can quantify the real performance gain that an increase in model complexity offers (keep in mind that more complex models are typically more costly to maintain).

If you haven’t done it yet… Download the Machine Learning Canvas now!

Stay tuned for my next articles: coming up with a value proposition that easily translates into an ML task, and monitoring real-world impact…

--

--

Own Machine Learning
Own Machine Learning

Published in Own Machine Learning

Want to create value and make an impact with Machine Learning? Read this!

Louis Dorard
Louis Dorard

Written by Louis Dorard

Sharing the power to create value with Machine Learning systems 💪🦾 Author of the ML Canvas. Course creator at OwnML.co.