An overview of time-aware cross-validation techniques

Published in

ELCA IT

8 min readDec 2, 2021

Author: Matthias Ramirez

To successfully apply machine learning-based solutions, we need reliable ways to estimate the performance of ML models. Assessing the power of generalisation of such statistical models is crucial in the development phase in many ways:

Finding the best approach to a prediction problem by performing model selection
Optimising the model’s performance with hyperparameter tuning
Assessing model robustness
Performing a final performance estimation by gauging how well the selected model performs on unseen data

Obviously, measuring the predictive power of a model-based solely on its performance on the training set is not sufficient because of potential overfitting. The available data is usually split into three independent sets, one for training, one for validating the model and fine-tuning its hyperparameters, and finally one for testing the model once it is fully trained. This way, we can get an idea of how our model behaves when facing new, never-seen-before data.

Such a simple way of splitting reaches its limits when the number of data points is reduced: a small validation set might not be representative enough and will only yield an inaccurate estimate of the model’s performance. As we say, more data is usually better so we would like to use as much of our data as possible. This is where cross-validation comes into play.

As significant as this method is for most machine learning applications, cross-validation needs to be adapted when using temporal data.

The basics of cross-validation

Principle

The purpose of cross-validation is to get an unbiased estimation of the machine learning model’s prediction error as well as other performance metrics. It involves an iteration over several steps, each of which consists in 1) splitting the dataset into complementary subsets, 2) training the model on one of them — called the training set –, and 3) testing the model on the remaining set — called the validation set. The different metrics are usually averaged over the cross-validation rounds.

A third subset called the test set is often kept aside in order to give a final idea of the predictive performance once cross-validation is over and the hyperparameters (or the model itself) are selected.

Cross-validation can be either: exhaustive (the dataset is split in all possible ways) or non-exhaustive. For the sake of simplicity, we will focus on a non-exhaustive method: k-fold cross-validation.

The example of k-fold cross-validation

In the case of k-fold cross-validation, the train-validation split occurs k times. The original dataset is split into k subsets of equal size, which are sequentially used as the validation set while the remaining k-1 are used for training.

*Example of k-fold cross-validation with k=3 folds*

Maintaining the representativeness of the whole dataset across each fold is crucial. If such dataset contains several classes, stratified cross-validation ensures the class distribution is preserved in each subset. The advantage of stratification compared to regular cross-validation is the reduction of bias and variance.

Time-aware cross-validation

Why things are different with time series

What distinguishes temporal data from other kinds of data is — obviously — time. More precisely, it is the fact that it flows in one and only direction, which means temporal data is oriented. When the task is time series prediction, we have no other choice but to learn from the past to be able to predict the future. We need to keep this temporal dependency in our cross-validation.

Moreover, because of this strong dependency of data along the temporal axis, selecting one specific portion of it for our test set might be problematic because it might not be representative of what could take place in the future.

Here are the two changes one should apply to “vanilla” cross-validation:

Respect the chronological order of the data: the training set should occur before the validation set, which should occur before the test set
Do not choose the test set arbitrarily: the test set should move over time

Nested cross-validation

The nested cross-validation consists in two loops:

The outer loop represents going forward in time. The size of the dataset that we split into the training and test sets grows bigger at each step. The outer loop runs on the training set and once the best candidate model has been selected and trained, its “true” performance can be estimated on the test set.
The inner loop consists in splitting the training set into the training and validation subsets and performing hyperparameter optimisation. Time does not go forward in this loop.

The prediction error is then estimated as the average of prediction errors obtained in the outer loop.

N.B.: One should keep the temporal order between all sets (training → validation → test)!

*A first attempt at performing time-aware nested cross-validation: not quite right!*

The example above looks correct at first sight because the temporal order of the data is respected. However, one can notice the overlap of test and validation sets from one iteration of the outer loop to the next, which implies that they will not be fully independent: we have information leaking during our model selection. Because most of the test set is also used for validation, the test error will be biased.

This can be solved by removing the overlap: rather than using sliding windows, we should opt for tumbling windows as they induce a separation between the sets across each fold.

Another issue comes from when not all folds have the same size, but this can be tackled by using a weighted average instead of the simple mean for the performance metrics.

*A correct way of performing time-aware cross-validation*

Limitations of time-aware cross-validation

Unfortunately, time-aware cross-validation has its own flaws that can be serious impediments in certain cases.

Compared to the same volume of data in time-agnostic settings (i.e. classical k-fold), the tumbling windows of time-aware cross-validation imply working with smaller splits, which could massively impact the measured performance depending on the time series’ characteristics (seasonal patterns, autocorrelation, etc.). Indeed, smaller splits mean only short-term dependencies are considered. Things would completely break down if the patterns with most predictive power lay outside the frame of the considered time windows.

Tips and tricks

How to split datasets correctly in Python

Splitting temporal data has to be done with precaution because mistakes are as easy to make as they are complicated to detect, especially when the data is transformed using sliding windows. One important misstep not to make is future data leakage, i.e. leaking future data in the training set.

Rather than manipulating indices directly which can be tricky, one can use Sklearn’s TimeSeriesSplit function that handles this by itself.

import numpy as np
from sklearn.model_selection import TimeSeriesSplitts_cv = TimeSeriesSplit(n_splits=5)# The actual values of X and y do not matter in this example 
X = np.random.rand(20) 
y = np.random.rand(20)for train_index, test_index in ts_cv.split(X): 
    print("Train:", train_index, "Test:", test_index) 
    X_train, y_train = X[train_index], y[train_index] 
    X_test, y_test = X[test_index], y[test_index]>>> Train: [0 1 2 3 4] Test: [5 6 7] 
>>> Train: [0 1 2 3 4 5 6 7] Test: [8 9 10] 
>>> Train: [0 1 2 3 4 5 6 7 8 9 10] Test: [11 12 13] 
>>> Train: [0 1 2 3 4 5 6 7 8 9 10 11 12 13] Test: [14 15 16] 
>>> Train: [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16] Test: [17 18 19]

It appears the moving policy of time windows here is set to “Growing” by default but a few tweaks can be made to get tumbling training sets:

""" Following the same pattern as sklearn.model_selection.TimeSeriesSplit """class TumblingWindowsSplit(): 
    def __init__(self, n_splits): 
        self.n_splits = n_splitsdef get_n_splits(self, X, y, groups): 
        return self.n_splitsdef split(self, X, y=None, groups=None): 
        n_samples = len(X) 
        fold_size = int(n_samples/self.n_splits) 
        indices = np.arange(n_samples) 
        # Perform a 80-20% split
        ratio = 0.8for i in range(self.n_splits): 
            start_idx = i * fold_size
            stop_idx = start_idx + fold_size 
            split_idx = int(ratio * (stop_idx - start_idx)) + start_idx
            # Yield train_indices, test_indices 
            yield indices[start_idx:split_idx], indices[split_idx:stop_idx]tw_cv = TumblingWindowsSplit(n_splits=5) 
# The actual values of X and y do not matter in this example 
X = np.random.rand(40)
y = np.random.rand(40)for train_index, test_index in tw_cv.split(X): 
    print("Train:", train_index, "Test:", test_index) 
    X_train, y_train = X[train_index], y[train_index] 
    X_test, y_test = X[test_index], y[test_index]>>> Train: [0 1 2 3 4 5] Test: [6 7] 
>>> Train: [8 9 10 11 12 13] Test: [14 15] 
>>> Train: [16 17 18 19 20 21] Test: [22 23] 
>>> Train: [24 25 26 27 28 29] Test: [30 31] 
>>> Train: [32 33 34 35 36 37] Test: [38 39]

A simple trick to spot future data leakage

Data leakage is a risk that has to be avoided when training models, as it might produce overly optimistic, biased results, whereby models would be “cheating” and would therefore become useless when deployed in production. This is particularly true when dealing with temporal data: future data does not exist yet!

Although data leakage may appear under many different shapes, and sometimes in non-obvious ways in which reviewing the code is not enough, we can use the temporal structure of our data to our own advantage and spot future data leakage easily thanks to the “zero tests” trick.

This test, as described in Datapred’s blog article “Advanced cross-validation tips for time series” consists in running the model training/testing step twice: once with the normal target, once with a modified target where values are set to 0 after a certain point in time.

zero_test_target = target.copy()
zero_test_target[Zero_date:] = 0

In the context of time series forecasting, the model is trained to predict the future values of the time series with a given horizon value, i.e. its output at time t corresponds to the predicted value of the signal at time t + horizon.

If the training process is correct and free of data leakage, then the model predictions on the modified target will not be different (i.e. start dropping to zero) before Zero_date + horizon, simply because it will not see anything different in the input before reaching Zero_date. If the prediction starts diverging before Zero_date + horizon, then it means the model has been made aware that the data is going to drop to zero before reaching Zero_date: this is future leakage!

The figure below perfectly illustrates this phenomenon: the red curve shows the model’s prediction for the actual, untouched target. The blue curve shows the modified target, for which values are all set to zero starting from date Zero_date (here t = 80). Once the model is re-trained using this new target, we notice its predictions (green curve) start diverging at time t=Zero_date to finally drop to zero once the horizon is reached. This means the model knew the future values of the target.