Predicting the future using Machine Learning part III

Cross validation + implementation in Pyhton

Mina Suntea
Analytics Vidhya
3 min readJan 8, 2021

--

Photo by James Harrison on Unsplash

As promised in the previous part of this series I will be discussing a method that can be applied so that a generic model is chosen which is less prone to overfit than the polynomial model with order 2. This method is called Cross Validation.

Cross Validation

The Cross Validation method is a method wherein the data is splitted in a training set and a validation set, given a ratio. So, for example, with a ratio of 0.6, 60% of the data is being used as a training set to train the model and 40% is being used as a validation set to which the model can be fitted. Implementing the Cross Validation method in Python gives:

Training and validation set for both the X and R parts of the data

The validation_split function returns the training and validation set for both the X and R part of the dataset. Mind you, we are still going to use the dataset I used in part I and part II of this series, which results in:

Chosen ratio of 0.6, which results in 60% training and 40% validation.

With this new method to split the data, different order polynomials can now just repeatedly be fitted on the training set while looking at which produces the lowest cost on the validation set. The set of weights that produce the lowest costs on the validation set generalizes the best fit to new data and is therefore the best overal fit on the dataset.

To find the best fit I wrote a function called best_poly_fit in which I iterate over a large range, like 1 to 50, of polynomial orders. Everytime a D_matrix is computed from the training samples and fitted with the poly_fit function to the corresponding training values, after which the cost of this fit is computed with the function poly_cost. Every loop the cost is compared to the minimal cost and gets replaced and returned along with the corresponding weights, depending on the value of the cost. So my best_poly_fit function looks like this:

The only thing left to do now is visualize the best found fit along with the best weights and the minimal cost:

Compared to the linear fit and the polynomial fit with order 2, this fit is by far the best. The advantage of the Cross Validation method is that the model found after training it with a part of the dataset is less prone to overfit or underfit. With this article I have come to the end of the first main type of supervised learning algorithm, namely regression. In my following article I will cover another main type of supervised learning algorithm, also known as the classification algorithm.

--

--

Mina Suntea
Analytics Vidhya

I am an AI student, who loves to conduct research and learn new things. I also have a fascination for the criminal mind as well as culture.