Einstein Prediction Builder: Which fields should I include or exclude from my model?

by Christopher Rupley, Lead Einstein Data Scientist, Salesforce

When configuring a new prediction, one of the steps involved is choosing which fields from your data you would like to include when building a predictive model. Since all of the predictive power of a model comes from what data we choose to show it, selecting the right fields is important for getting good predictions.

Consider an example of predicting which of your sales Opportunities are most likely to be won. An example of how the set of Opportunities could look is below:

The field we want to predict in this example is IsWon and the other fields are possible candidates to include as inputs into the predictive model.

What to Include

In short, include as much as you can. You may have some ideas about certain fields that would be useful for making predictions already. For our Opportunities example, maybe you know that they are more likely to be won when the Amount is not too high or when they come from a certain LeadSource or when the LastActivityDate is not so long ago that it has become stale. You should certainly include those fields. However, there could also be predictive power in fields that you might not expect. The opportunities from certain ContactId’s might convert better and lots of information could potentially be gained from the Description field even though it is just free text.

The point is, there can be many tiny signals in your data that can help indicate what the final outcome may be. You may not always notice them yourself or even be aware of them, but a predictive model can leverage them to make your predictions as good as possible. Generally, the more data you give to it, the better it can be.

What to Exclude

With that said, there are still certain kinds of fields that you should probably not include in your model. While generally more data is better, there are some exceptions for ethical, legal, and prediction quality reasons.

Ethical Concerns

If you are using the predictions of a model to make any kind of business decision, you are also indirectly using the information you used to produce that model in your decision. There can be a variety of reasons that using certain types of data for decision-making can cause ethical concerns, and it will depend both on what’s in the data and the problem you are applying it to. For example, it would make a lot of sense to include a customer’s gender when you are trying to decide items of clothing to recommend, but you would probably not want to use it if trying to predict what salary you should suggest when making a job offer. A quick check is to fill in the data field you are using and the problem you are solving into the following statement:

I am using <field x> to help me with <problem y>

If we apply this test to the examples above it would become,

I am using a customer’s gender to help me make the best clothing recommendation possible.

I am using a customer’s gender to help me decide what starting salary to offer. ❌

If you are not comfortable making that statement, the field should not be included in your model. For more details on ethical use of data and bias, see this post.

Legal Concerns

There can also be situations where it is prohibited by law to use certain information when making decisions. If a field contains information on a person’s race, religion, gender, or nationality, you wouldn’t want to use it as input on something like making hiring decisions in places, such as the United States, where such a thing is not allowed.

You can apply the same test as when evaluating potential Ethical Concerns here as well and ask yourself if there could be any legal restrictions on including certain fields in your decision-making process. If your business involves things like decisions on employment, lending, healthcare, or any other similarly regulated areas, it is worth reviewing the list of fields you are using.

Fields with “Hindsight Bias”

There are certain situations where including a field in your predictions can actually make them worse. We can say that these fields show a “hindsight bias”. This is a field where the contents are filled in or updated on a record some time after the final value of the prediction field is determined. An example of this would be filling in the sale “Value” of an Opportunity at the time when it is won. The Value field would appear to be a very good predictor of winning an Opportunity since whenever it is present, the Opportunity is won every time. However, we cannot actually use Value as a predictor in practice since it is never available before the Opportunity is won (that is, it only looks like a good predictor “in hindsight”). Some other general examples of this type of issue include:

  • Fields that are only filled at time of “conversion” or after, such as in the “Value” example above.
  • Formula fields that depend on the thing you are trying to predict should be excluded. For example, you may have a field that you use to identify a follow-up after an opportunity is won whose formula starts with IF IsWon AND .... This field should not be included.
  • If the field you are trying to predict is a formula field, any fields that appear in that formula should not be used. Suppose that instead of predicting IsWon, you are predicting another field, ExpectedValue which is equal to the formula (Value * LikelihoodToWin). In this case, you should exclude both the Value and LikelihoodToWin fields.

If you have any fields that fit these criteria, they should probably not be included in making your predictions.

Using Feedback from The Scorecard

You may also find additional fields to exclude from you model by looking at your model scorecard after the first time you produce a prediction. We can look at the Predictors and Details tabs to see how fields have influenced the predictions and look for a few different indicators.

If you see things like a field that has a correlation that is much higher than you would expect, especially if that one field has a much much higher Impact than everything else, you may want to consider excluding it from your predictions.

The example scorecard above shows a good example of a field that should be removed. The combination of a Prediction Quality that is “Too High” (99) and a single Top Predictor that is much higher than the others (Value) is a good indicator that that field should be considered for removal.

You can learn more about the Einstein Prediction Builder scorecard here.

--

--