The ultimate Guide for Over fitting, Under fitting and Dimensionality Reduction.

Sameer Sitoula
Analytics Vidhya
Published in
7 min readSep 5, 2020

Let us consider that we are designing a machine learning model. A model is said to be a good machine learning model, if it generalizes any new input data from the problem domain in a proper way. This helps us to make predictions in the future data, that data model has never seen. Now, suppose we want to check how well our machine learning model learns and generalizes to the new data. For that we have over fitting and under fitting, which are majorly responsible for the poor performances of the machine learning algorithms.

Underfitting:

A statistical model or a machine learning algorithm is said to have underfitting when it cannot capture the underlying trend of the data. (It’s just like trying to fit undersized pants!) Underfitting destroys the accuracy of our machine learning model. Its occurrence simply means that our model or the algorithm does not fit the data well enough. It usually happens when we have less data to build an accurate model and also when we try to build a linear model with a non-linear data. In such cases the rules of the machine learning model are too easy and flexible to be applied on such a minimal data and therefore the model will probably make a lot of wrong predictions. Underfitting can be avoided by using more data and also reducing the features by feature selection.

Overfitting:

A statistical model is said to be overfitted, when we train it with a lot of data (just like fitting ourselves in an oversized pants!). When a model gets trained with so much of data, it starts learning from the noise and inaccurate data entries in our data set. Then the model does not categorize the data correctly, because of too much of details and noise. The causes of overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic models. A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees.

How to avoid Overfitting:

The commonly used methodologies are:

  • Cross- Validation:

A standard way to find out-of-sample prediction error is to use 5-fold cross validation.

  • Early Stopping:

Its rules provide us the guidance as to how many iterations can be run before learner begins to over-fit.

  • Pruning:

Pruning is extensively used while building related models. It simply removes the nodes which add little predictive power for the problem in hand.

  • Regularization:

It introduces a cost term for bringing in more features with the objective function. Hence it tries to push the coefficients for many variables to zero and hence reduce cost term

  • Good Fit in a Statistical Model:

Ideally, the case when the model makes the predictions with 0 error, is said to have a good fit on the data. This situation is achievable at a spot between overfitting and underfitting. In order to understand it we will have to look at the performance of our model with the passage of time, while it is learning from training dataset. With the passage of time, our model will keep on learning and thus the error for the model on the training and testing data will keep on decreasing. If it will learn for too long, the model will become more prone to overfitting due to presence of noise and less useful details. Hence the performance of our model will decrease. In order to get a good fit, we will stop at a point just before where the error starts increasing. At this point the model is said to have good skills on training dataset as well our unseen testing dataset.

Introduction to Dimensionality Reduction

In statistics, machine learning, and information theory, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

Machine Learning:

As discussed in this article, machine learning is nothing but a field of study which allows computers to “learn” like humans without any need of explicit programming.

What is Predictive Modeling:

Predictive modeling is a probabilistic process that allows us to forecast outcomes, on the basis of some predictors. These predictors are basically features that come into play when deciding the final result, i.e. the outcome of the model.

Dimensionality Reduction

In machine learning classification problems, there are often too many factors on the basis of which the final classification is done. These factors are basically variables called features. The higher the number of features, the harder it gets to visualize the training set and then work on it. Sometimes, most of these features are correlated, and hence redundant. This is where dimensionality reduction algorithms come into play.

Dimensionality reduction is the process of reducing the number of random variables under consideration, by obtaining a set of principal variables. It can be divided into feature selection and feature extraction.

Dimensionality Reduction important in Machine Learning and Predictive Modeling

An intuitive example of dimensionality reduction can be discussed through a simple email classification problem, where we need to classify whether the e-mail is spam or not. This can involve a large number of features, such as whether or not the e-mail has a generic title, the content of the e-mail, whether the e-mail uses a template, etc. However, some of these features may overlap. In another condition, a classification problem that relies on both humidity and rainfall can be collapsed into just one underlying feature, since both of the aforementioned are correlated to a high degree. Hence, we can reduce the number of features in such problems. A 3-D classification problem can be hard to visualize, whereas a 2-D one can be mapped to a simple 2 dimensional space, and a 1-D problem to a simple line. The below figure illustrates this concept, where a 3-D feature space is split into two 1-D feature spaces, and later, if found to be correlated, the number of features can be reduced even further.

There are two components of dimensionality reduction:

  • Feature selection:

In this, we try to find a subset of the original set of variables, or features, to get a smaller subset which can be used to model the problem. It usually involves three ways:

1. Filter

2 .Wrapper

3 .Embedded

  • Feature extraction:

This reduces the data in a high dimensional space to a lower dimension space, i.e. a space with lesser no. of dimensions.

Methods of Dimensionality Reduction

The various methods used for dimensionality reduction include:

  • Principal Component Analysis (PCA)
  • Linear Discriminant Analysis (LDA)
  • Generalized Discriminant Analysis (GDA)

Dimensionality reduction may be both linear or non-linear, depending upon the method used. The prime linear method, called Principal Component Analysis, or PCA, is discussed below.

Principal Component Analysis

This method was introduced by Karl Pearson. It works on a condition that while the data in a higher dimensional space is mapped to data in a lower dimension space, the variance of the data in the lower dimensional space should be maximum.

It involves the following steps:

  • Construct the covariance matrix of the data.
  • Compute the eigenvectors of this matrix.
  • Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large fraction of variance of the original data. Hence, we are left with a lesser number of eigenvectors, and there might have been some data loss in the process. But, the most important variances should be retained by the remaining eigenvectors.

Advantages of Dimensionality Reduction

• It helps in data compression, and hence reduced storage space.

  • It reduces computation time.
  • It also helps remove redundant features, if any.

Disadvantages of Dimensionality Reduction

• It may lead to some amount of data loss.

  • PCA tends to find linear correlations between variables, which is sometimes undesirable.
  • PCA fails in cases where mean and covariance are not enough to define datasets.
  • We may not know how many principal components to keep- in practice, some thumb rules are applied.

--

--

Sameer Sitoula
Analytics Vidhya

Lecturer | Entrepreneur | General Manager and CFO of Tours and Travels Fim | Former Computer Science Engineer at Intel Corporation | Masters’ degree in Informat