Time Series Split with Scikit-learn

Keita Miyaki
Keita Starts Data Science
3 min readAug 15, 2019

In time series machine learning analysis, our observations are not independent, and thus we cannot split the data randomly as we do in non-time-series analysis. Instead, we usually split observations along with the sequences.

Photo by Sonja Langford on Unsplash

We split data into training set and test set in everyday machine learning analyses, and oftentimes we use scikit-learn’s random splitting function.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In time series analysis, however, we are not able to use this simple command, since observations in our time series datasets are not independent. The characteristics of time series data, such as autoregressive nature, trend, seasonality, or cyclicality, would not allow a random split to be valid. As a simple example, if your observations are autocorrelated, having an observation at time t in the training set and another observation at time t+1 in the test set would cause a trouble. A model which knows the former naturally predict the latter well. In that case, the test score will be too optimistic about the predictive power of the model.

We separate the test data from training data as a certain part of the end of the dataset. If you have observation over 10 years, for example, you may use first 7 years for training and the last 3 years for testing the model. The code is simple.

X_train = X[:int(X.shape[0]*0.7)]
X_test = X[int(X.shape[0]*0.7):]
y_train = y[:int(X.shape[0]*0.7)]
y_test = y[int(X.shape[0]*0.7):]

In this way we can train our models with less concerns about the validation. The same can be said about hyper-parameter tuning through cross-validation. The basic approach for that in non-time-series data is called K-fold cross-validation, and we split the training set into k segments; we use k-1 sets for training for a model with a certain set of hyper-parameters and measure the performance over the remaining set. We try it for k times over different combination of segments. For this scikit-learn’s GridSearchCV is handy. Yet, for the same reason stated above, we cannot use this convenient function for time series data.

scikit-learn documentation

Scikit-learn offers a function for time-series validation, TimeSeriesSplit. The function splits training data into multiple segments. We use the first segment to train the model with a set of hyper-parameter, to test it with the second. Then we train the model with first two chunks and measure it with the third part of the data. In this way we do k-1 times of cross-validation. Grid-search of hyper-parameter with TimeSeriesSplit requires some manual coding.

from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
i = 1
score = []
for tr_index, val_index in tscv.split(X_train):
X_tr, X_val = X_train[tr_index], X_train[val_index]
y_tr, y_val = y_train[tr_index], y_train[val_index]
for mf in np.linspace(100, 150, 6):
for ne in np.linspace(50, 100, 6):
for md in np.linspace(20, 40, 5):
for msl in np.linspace(30, 100, 8):
rfr = RandomForestRegressor(
max_features=int(mf),
n_estimators=int(ne),
max_depth=int(md),
min_samples_leaf=int(msl))
rfr.fit(X_tr, y_tr)
score.append([i,
mf,
ne,
md,
msl,
rfr.score(X_val, y_val)])
i += 1

--

--

Keita Miyaki
Keita Starts Data Science

Keita is a trained data scientist with expertise in finance and investment, a proud Japanese national, a chef, Judo black belt, a calligrapher, and a wine lover