Einstein Prediction Builder: Complication with Dates
by Michael Weil, Senior Data Scientist at Salesforce
Customer data in Salesforce has a rich variety of field types ranging from string, double, picklist to date, datetime, and time. Einstein Prediction Builder takes into account these different types to prepare the data for modeling (for more information on modeling and some of the other terms used in this post, check out this blog).
In the case of date fields, it will leverage information about the day, day of the week, month, and year, the number of days between two dates, and more.
With the usage of date fields, here are some common misconceptions users should be aware of:
1. Leakage with Date
Dates fields are not immune to problems of leakage. A more in-depth post on data leakage is coming soon. The idea is that machine learning models will learn from information present in the data during training, but that will be missing on new records one is trying to predict until the true label is revealed.
Consider an example of an Opportunity object where the goal is to predict whether the opportunity will be won or not.
In the example above, Customer B and Customer C do not have the ClosedDate
field filled in yet, the reason being that those are still in an intermediate stage- Negotiation/Review
, Prospecting
. By default, the label IsWon
is No
for these two customers. We can think that the label should be empty during intermediate stages, but we might have an example of records in which customers are “lost” early during sales development, and admins do not update the Stage field.
If we are training a model on this data and by keeping this ClosedDate
field, it will “remember” the association between missing CloseDate
and IsWon
being No
. On new data, the opportunities are not closed yet; therefore, the date field is missing. As a consequence, the model would predict No.
The problem boils down to dates that are posterior to the label. Other fields having similar issues as CloseDate
include DaysToDate
and DaysInStageX
.
Other sources of leakage are automated processes done with the dates. Let’s take the example of Lead data where we are missing information on some lost leads: leads lost during a month have by default the first day of the month as the open date. If a lost lead had been opened in December 2019, the open date would be by default 12/1/2019. In that case, the model will learn an association between the open day being the 1st and the record being lost. This model will be heavily biased towards new records that are open during the first of the month.
Admins should also be cautious when trying to predict a formula field. For example, if the label is a formula field in relationship with a date field : LABEL = If(DATE >= 08/03/2019) TRUE Else FALSE
The field DATE
determines LABEL
, therefore it should be hidden from the model.
The same goes when including a date that is a formula field using the label: DATE = If(LABEL == TRUE) 12/31/2019 Else Null
This DATE
will be useless in training since LABEL
will be missing for new records to predict.
As you can see, dates are a significant source of leakage, and most of the time, it makes sense to exclude them when selecting which fields from your data to include in your model. For more guidance on which fields to include and exclude for your prediction, check out this blog.
2. Dates disguised as Strings
Admins can create many custom fields of different types. But sometimes the salesforce field types are misused. For example, admins might be tempted to create a custom field of type string despite containing dates.
Einstein can leverage interesting information from date fields such as the day of the week, the month of the year, etc. But in that case, as this CustomField
is of type string, it can’t be inferred as a Date; therefore, we are losing this information. Be aware of this when choosing your field types!
The use of the typed string usually comes from the fact that dates are not in the same format. In the example below, some dates are not in the MM/DD/YYYY format. Besides making Einstein Prediction Builder’s life easier, using a Date type will bring consistency to your data as an added benefit!
3. The Case of System Fields
In addition to custom fields, Salesforce contains generic fields called System Fields. Those are fields that are updated during API operations such as record creation, record updates, etc. Some of these System Fields are dates: CreatedDate
, LastModifiedDate
, SystemModstamp
. In general, when training the model, these fields are automatically filtered out as those dates are irrelevant for building a prediction. But there might still be a risk.
Let’s take the example of an admin trying to predict a Sale Cycle Length using this formula :
Sales_Cycle_Length__c = CloseDate__c - CreatedDate
This formula is probably not what the admin wanted, as the system field CreatedDate
indicates when the API created the record, not necessarily when the user did. For instance, if the data has been uploaded once in bulk, the value of CreatedDate
corresponds to the date of this bulk upload.
You should consider removing fields that are (or related to) System Fields. Also, you should specify your own created date (as a custom field) as a best practice: CreatedDate__c
Another word of caution regarding system fields: fields are not being reevaluated in real-time.
For instance, let’s say you have a formula field with Now + X # of days,
, for example, you define your training set for a membership renewal scenario as: CreatedDate > Now + 90 days
. “Now"
will not be updated automatically daily but only once a month, at the time of training, when it will be substituted with the actual date and records that meet Training filter requirement at that time will be used for training
4. Mixing historical data
For some use cases, a wide range of historical data might be available throughout the years, and it might be better to segment data accordingly to avoid some mix-up. Especially if the business processes what a specific file is used for, or the way to collect data has changed over time.
There is also the odd case where the same instance is evolving over time. For example, if an admin wants to predict who is likely to become part of a frequent flyer program, it could be that some customers have fallen in and out of status over time, so there is a chance to encounter multiple instances of the same customer :
In this case, there are records of Customer A in both 2020 and 2018. In 2018, this customer was a frequent flyer; in 2020, she is not anymore. This indicates that this data has a time component in which records change over time. It is not necessarily a yearly cadence; The period can be in months, days, seconds.
In that sort of problem, it would be desirable to select the data accordingly. Potential ways to address this scenario include training on 2019 data in order to predict 2020, picking the most recent record for a given customer, or setting it up in such a way that a customer is considered a Frequent Flyer (“Yes Label”) if she/he has ever been a Frequent Flyer.
5. Time Series
As seen above, admins sometimes want to solve specific problems where dates/time play a huge part. In the case of records that are ordered by time, the use of models to predict future values is then called time series forecasting. A date field indexes data and usually equally spaced by time (minutes, days, months,…).
Examples of such predictions include predicting sales price, weather temperature, number of bookings, and case volume.
Time Series is generally composed of a systematic pattern and some random noise. In addition, you can decompose the pattern into:
- Trend — a component that changes over time and does not repeat.
- Seasonality — a component that repeats periodically.
Time Series forecasting has its variety of techniques like (seasonal) ARIMA models or Deep Learning.
Einstein Prediction Builder does not currently support those methods.
If you think your prediction might be a time-series, please consider another tool for predicting the forecast, such as Einstein Analytics Time Series.