Validation techniques for Time-series and Non-time-series datasets

Vikash
TheCyPhy
Published in
5 min readJun 21, 2020

I believe one of the important task in creating a machine learning model is selection of validation technique. A proper validation technique selection helps you to understand your model and estimate an unbiased generalization performance.

A validation technique defines a specific way to split available data in training and test sets to measure the accuracy of a machine learning model.

The selection of a validation technique depends upon various factors, some of these factors are size of data set, type of data set etc. A validation technique could be time series or Non-time-series technique. Time-series techniques have the unique ability to preserve the temporal order of data and must be used when data is time-sensitive, i.e. when data shows a pattern related to time. Non-time-series techniques ignore the temporal order of the data. I will focus on popular non-time-series technique cross-validation and time-series technique walk-forward validation using EEG Eye State Data set.

Non-times-series techniques

Non-times-series technique vary in the way datasets split into train and test sets as it could be due to random sampling, with or without replacement. Cross validation is resampling procedure that divides the data set into many disjoint parts of approximately equal size. Cross-validation is used to evaluate machine learning models on a limited data set. The K-fold cross validation is the most used non-time series technique and it make use of random sampling strategies to construct several training and test sets on which accuracy of model is averaged.

Leave one out cross validation is a k-fold cross validation techniques in which k is equal to N, the number of data points in the set. That means N separate times, a model is trained on all the data except for one point and a prediction is made for that point. The problem with this technique is low bias. Low bias means when the model is deployed on production and new test set is given to the model then accuracy is goes down and error rate increases.

k-fold cross validation uses random sampling strategies to construct several training and test sets. A data set is divided into k subsets. Each time one of the k set is used as test set and the other k-1 subset are put together to form a training set. Then the average error across all k trials is computed. The disadvantage is that it reduces realism, and overestimate model accuracy, as it assumes that (in case of classification) instance of each class is balanced in training and test data set.

Stratified K-fold cross-validation: The problem with k-fold cross validation is that there is no equal distribution of class labels in the train and test data set. To solve the problem of disproportion of data we use stratified k-fold cross validation technique. In stratified k-fold validation technique the folds are selected so that mean response value is approximately equal in all the folds, this means that each fold contains roughly the same proportion of the instances of the class.

Time series validation techniques

Walk-forward technique is widely used time-series technique. In walk-forward, the datasets is divided into parts, i.e. the smallest unit that can be ordered. Then parts are chronologically ordered, and I each run, all the data available before the part to predict is used as the training set, and the part to predict is used as test-set. Afterward, the model accuracy is computed as the average among runs. Walk-forward technique is basically used to re-train the model each time as the new data is available.Time series validation techniques have both advantage and disadvantage. One of the main advantage is they replicate a realistic usage scenario. Another advantage is they are fast and inexpensive as number of runs is equal to number of parts. Another main advantage is that they are not affected by any bias related to the randomness with which the training and test sets are generated.

Comparison of accuracy of KNN model using k-fold Cross validation and Walk-Forward Validation using EEG Eye State data set

Accuracy using KNN model with k-fold Cross-Validation
Accuracy using KNN with k-fold Cross-Validation
Accuracy using KNN with Walk-forward validation

Conclusion

This article reflects the importance of preserving temporal ordering of the time series data set when validating prediction model. EEG Eye State data set is a time series data set. Above images shows the difference in the accuracy of the prediction model using time-series validation technique (Walk-Forward) and Non-time-series technique (k-fold cross validation). When choosing the validation technique we must consider the prediction model usage scenario, the type of research question, the conclusion to draw. Validating techniques that do not consider the temporal order of the data might be appropriate for testing the performance of the model but they have misleading positive bias towards applied model in case of time series data.

References

[1] Davide Falessi, Likhita Narayana, Jennifer Fong Thai, Burak Turhan,Preserving Order of Data When Validating Defect Prediction Models.

[2] Syed Hasan Adil, Mansoor Ebrahim, Kamaran Raza, Prediction of Eye State Using KNN Algorithm

For complete code visit here

Thank you for Reading!

If you are, like me, passionate about Data Science, Machine learning, please feel free to add me on LinkedIn or follow me on Twitter.

--

--