Tips and Tricks: Classifier models with an element of time

Ben Houghton
Data & Waffles
Published in
7 min readOct 4, 2019

A significant proportion of the questions that my team (and I’m sure many other teams) are asked to work on goes along the following lines:

“Can we use x, y and z variables to predict if this particular event is going to happen in the next k units of time”. Here, the event may be a customer taking out a product; a customer having a life event or a company going into decline and the units of time could be minutes, weeks, months etc.

Searching on the internet seems to offer little advice on how to handle these challenges or highlight what the potential pitfalls might be. In this post, I hope to start to plug that gap.

Before we start, we need a way of structuring the problem. There are (at least) two ways of doing this:

  1. Treat the event as something which may or may not happen at any point in time
  2. Model the event as something that is guaranteed to happen, and thus predict if such a thing happens in the next block of time (e.g. 1 month) as opposed to after that block

For the first case, we may use a standard classifier (decision trees, logistic regression etc.) structured carefully. In the second and more restrictive case, we can use techniques from survival analysis to help us put some extra structure onto the problem (see an upcoming post on survival analysis and machine learning).

The rest of this article features some tips and tricks to best tackle problems with the second of these structures. I will not talk about the model choice and the technicalities (at least not in this post) and focus more on structuring the problem and the data, leaving it open to the data scientist to choose the correct machine learning model accordingly.

Notation-wise, I will assume that my target variable (y) is structured so that y=1 implies the event happens and y=0 implies it does not within k units of time from a starting point. This k units of time will be referred to as the target observation period or TOP. I will refer to any (potentially unbounded) time period in which we observe independent variables as the IVOP (Independent Variable Observation Period)

1. Choose the observation period for your target variable carefully

It’s often the case that getting the target variable right is one of the hardest parts of a machine learning problem, and the element of time can make this all the more difficult. You not only need to determine what event you are looking for and how, but also the time period over which you want to look for the event in question to occur.

If I were to build a model to predict whether an individual customer will buy groceries in the next three months, I would, of course, find my model predicting a 1 for every customer (unless I had massively overfit). On the other hand, if I were to predict for the next 3 seconds, I would have a very sparse target set (i.e. mostly 0s), and probably end up building a model which predicts a constant 0 (and one which has no real-world application either).

You generally want to balance choosing a time period which makes sense based on the question you are trying to answer, with ensuring that you have a workable number of training examples in both classes. For instance, in a situation where I want to send an offer to a customer for a grocery voucher valid that week, my model should naturally be predicting grocery shopping over a period of one week. Modern techniques for dealing with class imbalance fortunately allow you to focus more on the problem statement than ensuring an even distribution of your target variable.

2. Be wary of subtle variability in your independent variables

In the same way that your target variable needs to be built with a careful consideration of time, so do your independent variables. I will talk about three types of independent variables here: static, dynamic and period-dependent:

Static independent variables are those which do not change over time. For example, the country of birth of a person or the manufacturer of a computer. These are very convenient variables to use as once they’re built, you do not need to recompute for individual data points or worry about the time element of them.

In reality though, almost no variables are static, and some which seem like they should be can often turn out to be red herrings. For example, the CEO of a company or the gender of a person should not be considered static variables. For an even more subtle example, the age of the customer at the first point in time they ever went to (e.g.) a grocery store would not necessarily be a static variable: if a person has never been to a grocery store before the observation period, but does so over that period, then the variable will change from NULL (or other default value) to a numeric figure.

The red herrings above will in fact be dynamic variables which need to be recomputed every time you score or retrain your model. These can pose a logistical nightmare, especially as these are the kinds of variables that organisations often choose to maintain in irregularly updated fact tables, and thus a completely up to date version of the variable may not be available at time of scoring or training.

We also need to talk about Period–dependent variables, which deserve a section all to themselves

3. Choose your IVOP carefully

Period-dependent variables are a special type of dynamic variable which are aggregates (or more generally, functions) of other data points over a particular timeframe. For example, the number of times a customer entered a super market over 6 weeks. For these variables, you need to carefully choose the observation period using some of the thought process from section one (i.e. balancing interpretability and sparsity of the variable).

Period-dependent variables pose one other challenge: what do you do if the entity (e.g. the customer) did not exist in your population over that full IVOP (they may be a new customer)? The way I deal with these generally depends on the situation:

  1. I may discard the entities from the training (and hence also any scoring) set if I can so that I don’t risk bias entering my model from imputation techniques.
  2. If this isn’t an option, I would extrapolate from the period for which the entity is observable. However, if an entity is only available for a fraction of the observation period, then you may end up with some highly skewed values after extrapolation which can be misleading for any machine learning mode

4. Be hyper-aware of target leakage

Target Leakage — the situation where information from the future or the TOP is integrated into your independent variables — needs very careful attention, especially when you are using a lot of dynamic variables. I have heard of Data Scientists obtaining absurdly high levels of accuracy on these kinds of problems, only to find out that they have fallen victim to this. This may sound relatively simple to avoid, yet there are instances of target leakage which can in fact be quite subtle.

The key thing to check here is that all of your dynamic variables are observed in a time period which doesn’t overlap in any way with the observation period of your target variable. One check for this is to try building your independent variable set on a dataset where no future information is available, and checking that your distributions roughly match.

I once worked with a table which, although was supposed to only refer to activity for one particular day, it also included some information about the first half hour of the next day due to lags in the update process. This meant that we built a model which perfectly predicted events which happen in the first half hour of the day, without realising that it didn’t predict well for the other events at all.

5. Mix up your time periods in your training set

This is a nice way to address any seasonality without needing to do any explicit modelling for it. If you can get observation data for a full year (for example) as well as the TOP afterwards to make the target variable available, then it is a good idea look at different time periods for different data points (at random or stratified by month). You can then add another variable to indicate the observation time within the year to allow you to eventually ‘machine-learn’ the seasonal effects.

There are a few things to be careful of here. The main one is to ensure that, for each entity, you are choosing a consistent observation period for all of the independent variables, as well as the target variable — this can often be a bit of a coding nightmare. On top of this, you will need to ensure that each entity is included at most once to keep your sample IID (i.e. each data point being independent and identically distributed).

6. Modularise your code early

With all of the above in mind, it becomes abundantly clear that keeping your code well-structured is very important to avoid any of the potential pitfalls. This is especially the case when you want to experiment with different observation period lengths for your target or independent variables. I could easily write a whole article on this, so for now I will leave this to the user, but my biggest bit of (general) advice here is to keep your code well-structured from the start and remain disciplined as the code base builds out.

As you can see, these time-based problems are really non-trivial and add significant layers of complexity to your standard machine learning problem. I hope these tips give some guidance on the key considerations one must have when tackling these challenges.

--

--

Ben Houghton
Data & Waffles

Principal Data Scientist for Analytical Innovation at Quantexa