Time Series: The problem with resampling

Published in

TotalEnergies Digital Factory

7 min readNov 24, 2020

Data Science in our daily life is clearly increasing and we can see a lot of new use cases appearing every day. And if I should define it, I’ll say that Data Science is at the crossroad between Mathematics, Physics and Computer Science; It aims at revealing patterns in data and using them to produce new insights to help people decide.

As I said upper, Data Science is something moving daily so how should I hope for a better transition to move on to the topic of Time Series Analysis?

This article aims to make people aware about data leakage brought by time series resampling. Here you can find a great introduction to data leakage from Rafael Pierre.

Data Leakage, Part I: Think You Have a Great Machine Learning Model? Think Again

Got insanely excellent metric scores for your classification or regression model? Chances are you have data leakage.

towardsdatascience.com

After reading this article you’ll have had:

An approach of how to resample a time series
A warning about risk of leakage using some interpolation methods

Finding patterns in time series is easier when time series have constant time steps. When time series have constant time steps as required in many analytical methods such as ARIMA process or in recurrent neural networks, and it also helps decision trees or classical neural networks to find patterns in the data.

To illustrate this article, I created the following time series.

For most use cases, the data provided isn’t clean, even more when the granularity is decreasing. Data points are often acquired by manual input, when a variation is detected or when an event occurred. Sometimes people may have decided to change the acquisition periods and that leads to different time steps in the series. These are problems that you generally have to deal with to get a cleaned time series ready to be processed in stream.

If we look at the distribution of delta time between time step.

Time delta between time steps in number of day

Distribution of time delta between time steps in number of day

It appears that timestep are not constants and values aren’t acquired at fixed time stamp, so we need to resample the data. Let’s take a look at the different methods of resampling.

The work that we have to do here is to align the time series to get fixed timestamps and resample the time series to get regular time steps.

Fixed Timestamps: Align the Time Series
Regular Time Steps: Resample the Time Series

There are different resampling strategies:

- Under sampling: take a bigger granularity
- Over sampling: take a smaller granularity (and so create new values)

These two methods will handle the alignment problem. By sampling you’ll create new time stamps with a fixed periodicity (such as every minutes, 30 mins, hours, days …). But it raises the following issue: which value do we assign to this new time stamp? The closest? The last recorded values? The next one? If several values were recorded between two new time stamps, should we take the last one, the mean? There is no real answer to those questions but there are some important aspects that you have to keep in mind. This article is a warning about leakage in Time Series, and it’s very easy to create leakage while resampling…

Align the Time Series

If several recorded values are between two new time stamps, we have to choose the way to aggregate them:

Mean
Sum
Last value
Next value
Min value
Max value
..

On a streaming case it is recommended to use the forward fill method which will affect the last aggregated value.

The default aggregation method with foward fill is using the last value:

Fake_to_resample_resampled = Fake_to_resample.resample(‘day’).ffill(limit=1)

But depending on the use case, other aggregation methods should be used:

Fake_to_resample_resampled = Fake_to_resample.resample(‘day’).mean().ffill(limit=1)

Choosing a Sampling frequency

The second step in resampling a time series is to choose how to assign a value to your missing time stamp. Using the nearest method, that consists in assigning the closest measured value to a new created time stamp, and backward fill method can bring values from the future to the past. Using values from the future is not only an issue that makes our model “cheating” but also a production problem too. We can’t include values got from an event that hasn’t occurred yet in our model.

To illustrate the impacts of over and under sampling I’ll use this cosine sampled with two different frequencies.

Under sampling

If the frequency used on the last three days of this series is defined as the new recording frequency, it could be interesting to resample the series using the new acquisition frequency. Resampling this way will create a longer historic.

Fake_to_resample_under = Fake_to_resample.resample('6H').ffill()

The drawback of this strategy is that we lose a lot of information.

Over sampling

If the frequency used the last three days was just a mistake/ misfunction of a sensor, since we have two different acquisition frequencies it could be interesting to choose the first one.

When a Time Series is over sampled, a lot of a timestamps are created with no close acquired values. For here we need to choose an interpolation strategy to fill those NaN:

Linear (fill NaN values with a straight line between two measured values)
Nearest (fill NaN values with the closest measured values in time delta)
Forward Fill (fill NaN values with the last measured values)
Backward (Fill NaN values with the next measured values, but by definition this is a leak)

Linear interpolation

Fake_to_resample_linear = Fake_to_resample.resample('T') .ffill(limit=1).interpolate(method='linear')

Nearest interpolation

Fake_to_resample_nearest = Fake_to_resample.resample('T') .ffill(limit=1).interpolate(method='nearest')

Forward Fill interpolation

Fake_to_resample_ffill = Fake_to_resample.resample('T').ffill()

Comparison

It is clear that linear and nearest interpolations are bringing values from the future to the past and creating leak so. Forward fill appears to be the best interpolation methods because nan values are filled with the last recorded. With Forward fill interpolation method no recorded value is used before it happens. Using nearest, linear and backward fill interpolation methods can be very dangerous, they can be used as long as you make sure not to introduce any leakage in your use case.

Between Over & Under Sampling

If the time delta between time step is not constant and will still change in the future, it could be interesting to find a compromise between over and under sampling.

Fake_to_resample_min = Fake_to_resample.resample('1H').ffill()

The smallest acquisition time delta was 1min and the biggest 8H, resampling with 1H period gives the following result. Since there is no real statistically perfect choice, choosing the best resampling frequency is always something discussed with subject matter experts.

We do not lose so much insight in the first part and we created eight time fewer fake data on the second part.

Application

Let’s look at the result of this resampling on the first created series.

X_resampled = X.resample('D').ffill()

Series resampled with a daily frequency and forward fill interpolation method

Using mean as aggregation function instead of last:

X_resampled = X.resample('0.5D').mean().ffill()

Series resampled with a half daily frequency and forward fill interpolation method and mean aggregation function

As a conclusion, resampling a time series can be dangerous, especially when dealing with streaming data. Using aggregation functions that bring values from the future isn’t bad, but we must be cautious.

End-to-End Anomalies Detection Models Evaluation Algorithms

Time series are numerical series representing the evolution of non-static specific phenomena through time such as in…

medium.com