Cross-validation tools for time series

Developing machine learning models for time series data requires special care, mainly because the usual machine learning assumption that the samples are independent generally does not hold.

In the present post, I will review cross-validation techniques suitable for machine learning models dealing with (a certain type of) time series data, inspired by Marcos Lopez de Prado’s excellent Advances in financial machine learning. Scikit-learn style classes implementing the cross-validation algorithms discussed below can be found on Github. For reasons that will become clear, they do not quite follow the standard Scikit-learn template.

Time series data

In a time series dataset, each sample is at the very least tagged with a timestamp. This is for instance the case for a series of measurements of a quantity at various times. In this post, we will however be interested in time series datasets having a bit more structure.

Consider the following problem.

  1. At certain times, the machine learning model has to make predictions, resulting in certain actions.
  2. At later times, the response (computed from the results of the actions) are known.

Between the time where a prediction is made and the time the corresponding response is obtained, more predictions might have to be made. This situation is rather common, as it occurs whenever machine learning algorithms are used to determine actions that do not have immediate effects. Here are some examples.

  • In a systematic trading strategy, we may have fixed some rules determining when to close open trading positions. Given these rules, we would like to use a machine learning algorithm to open positions in a way that maximizes profit. At a certain time, we need to decide whether to open a position or not. The returns of the position (the response) will however not be known before an unknown time, namely until the position is closed according to the closing rules.
  • Suppose a GoogleMap-like service tries to predict traffic conditions and the associated travel time on a trip where several routes are available. The machine learning model will send its users on the route it believes to be the fastest at prediction time, but the effective travel time of each user (the response) is not available until a later time, when they complete their trip.

Time series datasets associated to such problems have two timestamps attached to each sample: a prediction time, when the machine learning model has to make a prediction, and an evaluation time, when the response becomes available and the prediction error can be computed.

We will discuss cross-validation strategies adapted to this kind of problems.

Problems with standard cross-validation algorithms

Recall that cross-validation involves training the model only on a part of the dataset, the train set, while using another part of the dataset, the validation set, to evaluate its performance. The idea is that while it is easy to overfit the train set, the performance on the validation set will match the performance on new data, as long as the validation set is independent of the train set. Moreover, by using many different splits between training and validation set, one can compute performance statistics that hopefully provide a good idea of the performance on new data. Cross-validation is also often used to optimize the values of the hyperparameters of the machine learning model.

Maybe the best known cross-validation techniques are leave p out and K-fold cross-validation. Leave p out cross-validation randomly selects p samples as the validation set, and uses the rest of the samples as the training set. K-fold cross-validation randomly partitions the sample set into K equal-size subsets, uses each of them in turn as the validation set and all the other ones as the train set.

Two problems arise when applying these classical cross-validation algorithms to time series data:

  • Time series data is time-ordered. In real world applications, we use past observations to predict future observations. The randomization in the standard cross-validation algorithm does not preserve the time ordering, and we end up making predictions for some samples using a model trained on posterior samples. We will see that in itself, this is not such a big problem. The next problem is much more serious.
  • The time series data is often strongly correlated along the time axis (think about the GoogleMap example: a traffic jam affects all the users on the same route at a given time). The randomization will make it likely that for each sample in the validation set, numerous strongly correlated samples exist in the train set. This defeats the very purpose of having a validation set: the model essentially “knows” about the validation set already, leading to inflated performance metrics on the validation set in case of overfitting.

Walk-forward cross-validation

Clearly, the main source of these problems is the randomization. However, if we simply drop the randomization and arrange for the validation set samples to be posterior to all the training set samples, then for a given validation set size, there is only a single possible train/validation split.

Walk-forward cross-validation solves this problem by restricting the full sample set differently for each split. We first split our dataset into k equal blocks of contiguous samples and decide that the train set will always consist of p contiguous blocks. The splits are then as follows

  1. Train set: blocks 1 to p, validation set: block p+1
  2. Train set: blocks 2 to p+1, validation set: block p+2

In this way, by “walking forward” in the full sample set, one can construct k-p splits. This is a big improvement over the classical cross-validation algorithms, but some problems remain.

  1. Near the split point, we may have training samples whose evaluation time is posterior to the prediction time of validation samples. Such overlapping samples are unlikely to be independent, leading to information leaking from the train set into the validation set.
  2. It is impossible to make k-p very large, leading to a large variance in the performance statistics.
  3. The most relevant part of the dataset, namely the most recent one, is also the one that is used least. This can be a problem for time series subject to regime changes over long periods of time, as it is often the case in financial time series.

Purging

Problem 1. is rather easily solved, through purging. Purging involves dropping from the train set any sample whose evaluation time is posterior to the earliest prediction time in the validation set. This ensures that predictions on the validation set are free of look-ahead bias.

Combinatorial cross-validation

Problems 2. and 3. can be solved, provided we abandon the requirement that all the samples in the train set precede the samples in the validation set. This is not as problematic as it may sound. The crucial point is to ensure that the samples in the validation set are reasonably independent from the samples in the training set. If this condition is verified, the validation set performance will still be a good proxy for the performance on new data.

Combinatorial K-fold cross-validation is similar to K-fold cross-validation, except that we take the validation set to consists in j < K blocks of samples. We then have K choose j possible different splits. This allows us to create easily a large number of splits, just by taking j = 2 or 3, addressing Problem 2. As all the blocks of samples are treated equally, this addresses Problem 3 as well.

It is however clear that we cannot use combinatorial K-fold cross-validation as it stands. We have to make sure that the samples in the train set and in the validation set are independent. We already saw that purging helps reduce their dependence. However, when there are train samples occurring after validation samples, this is not sufficient.

Embargoing

We obviously also need to prevent the overlap of train and validation samples at the right end(s) of the validation set. But simply dropping any train sample whose prediction time occurs before the latest evaluation time in the preceding block of validation samples may not be sufficient. There may be correlations between the samples over longer periods of time. Again in the GoogleMap example, a traffic jam may last longer than the time it takes to a user to complete the trip.

In order to deal with such long range correlation, we can define an embargo period after each right end of the validation set. If a train sample prediction time falls into the embargo period, we simply drop the sample from the train set. The required embargo period has to be estimated from the problem and dataset at hand.

A nice feature of combinatorial cross-validation is also that as each block of samples appears the same number of times in the validation set, we can group them (arbitrarily) into validation predictions over the full dataset (keeping in mind that these predictions have been made by models trained on different train sets). This is very useful to extract performance statistics over the whole dataset.

Notes on the implementation of the cross-validation algorithms

Plain walk-forward cross-validation is available in scikit-learn. An implementation of purged walk-forward cross-validation is also available in Marcos Lopez de Prado’s book.

I implemented purged walk-forward cross-validation and purged embargoed combinatorial cross-validation. The code can be found on Github. The docstrings of the various classes and functions should be reasonably self-explanatory. The API is as similar to the scikit-learn API as possible. The main differences are:

  • The split method takes as arguments not only the predictor values X, but also the prediction times pred_times and the evaluation times eval_times of each sample.
  • To stay as close to the scikit-learn API as possible, this data is passed as separate parameters. But in order to ensure that they are properly aligned, X, pred_times and eval_times are required to be pandas DataFrames/Series sharing the same index.

Like the scikit-learn modules, split is a generator that yields a pair of numpy arrays containing the positional indices of the samples in the train and validation set, respectively.