The Art of Dimensionality Reduction
DR is one of the most critical steps of the predictive modeling problem. The world is generating a large amount of data with large dimensions. Hence it is crucial to optimize the dimensional space of the data.
What is Dimensionality Reduction (DR)?
Suppose you want to solve a predictive modeling problem, and for the same, you start to collect data. You would never know what exact features you want and how much data is needed. Hence, you go for the upper limit, and you collect all possible features and observations.
Consequently, you realize that you have collected a large amount of data. And, these extra features are intensifying the noise and time.
- Noise: There may be some feature, which model find irrelevant. Hence they are just adding noise to the model.
- Time: The time I am talking about is computational time. For each extra feature, we need to calculate its gradient and optimize it.
We do DR for two reasons. One, to reduce noise, which makes our model more robust. Second, to reduce computation time.
Hence we will discuss some of the standard ways to reduce the dimensions of a data frame.
We will discuss:
1. Correlation coefficient
2. Principal Component Analysis
3. K Select: Feature Importance
1. Correlation coefficient
One of the assumptions of Linear regression is that features are independent. If they are correlated, then the results may be misleading. It makes it inevitable to remove highly correlated features.
One intuition is, if two features are highly correlated, they will provide similar information to the model. Hence, we can remove one of the features. It will also reduce the dimensions of the dataset.
It is also worthwhile noting that the most common matrix used to find correlation is the Pearson coefficient. And it only measures the extent of the linear relation. Therefore, any non-linearly related feature will result in a zero Pearson coefficient.
2. PCA
The principal correlation coefficient is a technique of projecting data into lower dimensional space such that the variance is maximum. If we consider a linear projection, then the mean squared distance is minimum.
In the figure shown, we try to maximize the variance in linear projection. Data points are projected in such a fashion that the Mean Squared Distance is minimum.
We can use PCA to reduce the dimension of data significantly. We can also check how much variance is explained by the data in the new space. And once we perceive significant results, we can use this projected space.
3. K Select
“SelectKBest” in Sci-kit Learn library selects K number of best features and features which have a low variance are dropped.
Another way of selecting the K best features is by using the feature importance chart from decision trees or random forests.
We can train tree-based models and can straightforwardly get the importance of each feature. This technique is also used by most data scientists. Also, this is a simple and efficient way of reducing the dimension of data.
Summary
DR is important! It not only reduces the noise and computational-cost but also provides us the importance of features.
In this article, we have discussed three crucial techniques of DR, which are, Correlation, PCA, and K Select.
Hope you liked the article and I believe that you will apply these techniques in a real-world problem.
For more content on machine learning and data science, do subscribe to my youtube channel.
Keep Learning!
Keep Enjoying!