Tips to avoid the pitfall of over fitting in Linear Regression
- Split the data into training set and test set, the test set will not be used to train the model.
- Residual sum of squares is used to find out the best model, the goal is to minimize the residual sum of squares, this value is called training error, this is how model parameters are calculated
- But this split of data set into training and test data will work well only on availability of really large data sets
- The prediction accuracy is later tested out on the test data set, this data set was not used to train the model, again residual sum of squares is calculated on the test data and its now called as test error
- Meanwhile, the model parameters are continuously updated using optimization algorithms to reduce the prediction error.
- Its interesting to know how test error and training error varies with model complexity ( complexity of the model increases with increase in polynomial degree of the model, that is, the complexity is proportional to model order)
- Its seen that training error decreases with model order and the test error dips down with increase in model complexity but after a certain point it starts increasing again. This is can be observed in the image below
8. The choice of the model has to be based on the observation from training error and test error . Also its tricky to make choice of right features to come to make build the model for your predictions.