A Model That’s Too Good to be True - How to deal with Label Leakage

by Kevin Moore, Lead Data Scientist, Salesforce

Machine learning algorithms will learn patterns that are present in the data you show them, so be careful what you show them.

When you train a machine learning model, you’re implicitly telling the algorithm that the data you’re feeding it is trustworthy. You’re telling it, “Here are some examples of successes, here are some examples of failures. Extrapolate patterns from these so that we can predict the outcome of new records.” This can sometimes lead to surprising and unhelpful results, where the algorithm picks up on data that is filled in after the outcome is known.

As an example, imagine you’re a realtor and want to have predictions on whether a house will sell in a given timeframe. You’ve diligently collected the relevant data in a custom Salesforce object House__c, and have many past examples of houses that did and did not sell within the timeframe of interest — let’s say 3 months. Based on what you’ve read in Trailhead, this sounds like a great candidate to apply machine learning. A simplified version of the House__c object may look like this:

The houses you want to make predictions for would be houses that are on the market (preferably close to when they are posted), but not yet sold. This means certain pieces of information in the object will not be available, such as the final sale price, the closing date, closing costs, etc. You would instead expect your prediction to be based on information available before the house is sold, such as size/location data of the house in question, the asking price, and other similar quantities.

A human manually building a model would look at this data and manually exclude fields that aren’t available before the house is sold so that that model will only depend on the relevant fields that exist when predictions are required. However, the machine learning algorithms do not have wisdom on their own; they will simply do what you tell them (for more on how machine learning works, check this post out). If you ask a model to use all the fields in your object, then the algorithms will happily crunch all the data and find strong predictors of the outcome regardless of whether it makes business sense to use those fields. For example, a model may consist of the single rule that whenever the difference between” initial posting “and” close date “is less than 3 months, then the label is True! Moreover, the model will think it has done a great job since all its predictions on the existing labeled data turned out to be correct — even on the unseen holdout set it didn’t train on!

Unfortunately, this model would not be useful in making predictions in the context the realtor cares about. All its predictions on unsold would be negative because the close date is never filled out on unsold houses! We call this problem “Hindsight Bias or “Label Leakage,” see this Trailhead for more examples.

The machine learning pipelines powering Einstein Prediction Builder will do their best to remove fields that look like leakers, but these methods are not perfect. Hence, you need to be vigilant in inspecting your models for potential label leakage.

The main question you should ask yourself to determine if a field should be included in your prediction is:

  • Do the values of this field look similar to the values on the records for which I want to make predictions?

This should filter out common leakage sources where a field is modified after the label is known. An example of this would be leaving in a “closed reason” field that can only be filled out when the prediction is negative. All records you’d want a real prediction on would not have this filled in, so it doesn’t make sense to include in your model.

It can also help filter out fields whose usages have drifted over time. Perhaps you used to have a process where you would try and predict the outcome by hand or use a prediction from some other source. Unless you are very careful about leaving these fields unchanged once the label is determined, then including them can cause label leakage. Or perhaps you just have a field (IdInUnusedExternalSystem) that was used in the past but isn’t used anymore. It’s better to leave that field out since it won’t be filled in on any of the new records that you want to make predictions on.

How to Diagnose Label Leakage in Your Model

The first thing to check is the model scorecard (see this post for more information on the Einstein Prediction Builder Scorecard). If the model quality is listed as “Too High”, then that may mean there was label leakage. It could be that the model was just able to do an excellent job at predicting what you asked, but such models are often too good to be true, so you should be especially wary of high model quality. Even if your model quality is modest, it is still essential to check your top predictors and see if they make sense for your use case.

In an extreme example, where there is a leaky field that didn’t get removed automatically, you could see it contributing much more than other fields of your object.

If you see a single feature jump out as much more impactful than everything else in the model, then you should inspect your data and see if that makes sense for your use case. This can sometimes be a sign of label leakage but is not always a bad sign. Your data and use case may have a single field that’s legitimately much more important than everything else, but it’s advisable to do a double-check.

Another way to diagnose things after a model is built is by looking at the predictors’ detail page on the ScoreCard, which will show a table containing information about the top features (by the impact), in particular the feature name, its impact, correlation, and weight.

The main things to pay attention to are the impact (this is the weight scaled to be between 0 and 1), and correlation. Are there any features with large correlations or large impacts that should not be there?

Einstein Prediction Builder will automatically remove features above a certain correlation threshold because they are typically proxies for the label. However, it can still leave in features that you don’t want in your model. Check the correlations and see if any of the high correlation features belong in your model. Are any of these features known before the label is known? Are they modified at all after the label is determined? If they’re modified after the label is known, you should make a new prediction where you remove them, since the model may have learned from information that is unavailable when you want to make predictions.

There are also automatic tests that check whether selected fields look similar between the training data (labeled data that passes the custom training filter) and the scoring data (everything else). If a field looks radically different between the training and scoring data sets, then that indicates the field is not useful in the prediction because the model will learn patterns from the training data that are not present in the data on which it will make predictions. For example, this would catch and remove a field that is always filled in for training data (like Close_Date__c in the house price example from the beginning) but is never filled in on the unlabeled records on which you want to make predictions.

Putting This Into Practice

Here are a few examples that are similar to real-world cases we have run into when diagnosing models.

In this case, the label is the “Converted” field, making it a binary classification problem. The status field encodes more detailed information on why a record converted or why it didn’t. If you were to train a model that predicted the “Converted” field and included “Status” in your model, then you would be introducing label leakage in your model. This is a more subtle case than the ones shown before because some choices do not leak information and could exist on the unlabeled records you want to make predictions on (e.g., “Waiting”). Still, other decisions clearly indicate what the label is. The “Too expensive” choice always goes with a negative outcome, while the “Converted — 12mo subscription” choice always goes with a positive outcome. The field as a whole is a leaky field, even though not every choice is. You can again apply the test of comparing what the values look like on labeled vs. unlabeled records, here a very different set of choices, and determine that this field is not a good field to include in your model.

To give one more example, let’s say you want to predict whether a customer will make a late payment on an invoice — here the “Late” binary field. Assume that this field is filled out either after the due date has passed without payment, in which case it’s late, or when payment is received. There’s also a “Days late” field that defaults to 0 and corresponds to how many days late the invoice payment is. A value of 0 in this field means either the due date hasn’t arrived, and there’s been no payment yet (so not late yet), or the payment was received on the due date. Negative values correspond to early payments, and positive values correspond to late payments. Including a field like “Days late” in your model will also contribute label leakage into your model because the value often depends on the label itself. Applying the test of comparing what the values look like on labeled vs. unlabeled records, you can see that this field typically looks different between the two, which means it is not a good field to include in the model.

Summary

Before training a model, be careful to inspect your data for fields that leak information about the label that is unavailable at prediction time. For each field you include, ask yourself, “Do the values of this field look similar to the values on records I want to make predictions on?” If the answer is no, then you should not include that field.

--

--