5 Question Series — Data Science & AI — 4

Published in

Analytics Vidhya

6 min readAug 12, 2021

In this article I am going to write about Dimensionality reduction.. why it is important and what are all the methods to reduce the dimension of features.

Q1. What do we mean by the term “Curse of Dimensionality” and what is dimensionality reduction?

In any datasets we require to solve the problems, we have Independent and Dependent features. Independent features help in predicting the dependent values based on previous records. Now it is most important to select which independent features and how many of them we are using. Sometimes unnecessary and features with no importance reduces the accuracy of the model, so it is very important of selection of valuable features that contributes more in prediction. Increasing the number of features will increase the accuracy of model but it also has its limits, after some point of times increasing of features does not contribute much in the model accuracy rather it decreases the accuracy because if we increase features with no importance will result in bad accuracy and further increasing keeps on making accuracy bad. This is known as Curse of Dimensionality. To escape from this curse, we reduce the dimension of features i.e., we reduce the number of features that holds no importance in prediction and thus the concept of Dimensionality reduction came…

Q2. What are the different methods for dimension reduction of features?

There are several methods: -

· Univariate Selection
· Feature Importance
· Correlation Heat map Matrix
· Wrapper method

1. Forward Selection

2. Backward Selection

3. Recursive Feature elimination.

Let’s first discuss Wrapper methods:

Forward Selection — It is an iterative selection model which tends to select an extra feature at each iteration and checks the model accuracy. Cycle continues till the saturation point of model accuracy.

A — AB — ABC — ABCD — ABCDE

Backward Selection — It considers all the features at first and does statistical test like Chi — Square test to determine the accuracy and keeps on decreasing one features at each iterative till the accuracy keeps on increasing.

ABCDE — ABCD — ABC — AB — A

Recursive Feature elimination — It first select all the features which has most impact on the target and then add another most important feature and keep on doing this, until gets the features which has less impact on target variable.

Note: — Wrapper method is used only with smaller datasets.

Embedded Selection — It use random selection of single features or subset of features and check the accuracy. It follows permutation & combination method for selection of features. Whichever subset of features having maximum accuracy is chosen for model.

Univariate Selection — In this we randomly select the best features according to their importance ranking with the help of python library as SelectKBest (Information gain) and Chi2 (Chi — square), we set the k value depending how many top ranks features we need.

Feature Importance — In this, we get scores for each feature in our dataset and then ranked according to them. We select those features according their scores which can predict majority of outcomes. The higher the scores we take those features for predictions. Various algorithm we use in this i.e., Extra Tree Classifier, PCA, LDA, t-SNE, UMAP. Mostly we use PCA or ensemble technique (Extra Tree Classifier) for selection of features.

Correlation Heat Map — In this, we find out which all independent features are highly correlated to each and then remove one of the correlated features of those correlated features. We do this because if two features are highly correlated to each other then they’ll both have the same impact or relations with the dependent variables which in result decreases the accuracy or performance of the model. We use Pearson Correlation and Spearmann rank method to find out the correlation. After getting the correlation scores we remove the features though VIF (Variation Inflation Factors) or by manually setting the threshold value.

IMP: follow Krish Naik video on feature selection for in detail explanation. You can also follow this article for mathematical explanation of various methods.

Q3. What is Information Gain and Mutual Information?

Information Gain is used to calculate the entropy of the data and by entropy (disturbance or variance) we mean how the data is split in decision tree model. It helps in measuring the purity of the split. Information gain reduces the entropy and helps in construct of decision tree. Entropy of the split lies between 0 to 1. O means the pure split and entropy is less where 1 means impure split and is the worst-case scenario in terms of accuracy. For effective classification we calculate the entropy and Information gain for the features split and select the pattern of feature split which gives less entropy and higher Information gain. Information gain is also used for feature selection and often used to calculate gain in respect to target variable; known as mutual Information. Mutual Information calculates the dependency of one Independent variable on other. It is also used in finding correlation between two variables. Mutual Information ranges from 0 to 1 means, larger the value more the features are dependent on each other.

Note : For mathematical details of Information gain and Mutual Information read this article on Information Gain and Mutual Information for Machine Learning and follow Krish Naik video on Information Gain.

Entropy is calculated:

Where H(s) is the entropy and P- is the probability of negative event, P+ is probability of positive events. Range of entropy lies between (0 to 1).

Information Gain:

Where, S is the main or first feature and Sv is the feature after split.

Q4. Kullback — Leibler Divergence or KL Divergence?

KL divergence in statistics is used for comparing two different probability distribution, actual or observed probability distribution. This is also used for calculating Mutual Information and entropy. This comparison can be done either by calculating the statistical distance between the two distribution or by calculating the divergence between the two. KL Divergence uses the divergence method for comparing the two distributions. If for a random variable there are two distribution P and Q, then the divergence between P and Q can be calculated as: sum of probability of each events of P times the log of probability of event P over the probability of event Q.

KL(P || Q) = sum x in X P(x) * log(P(x) / Q(x))

If the probability of event in P is large as compared to that of Q then the divergence is more and vice-versa.

Note: we can follow this beautiful and in-detail article on KL divergence written by Jason Brownlee.

Q5. Jensen — Shannon Divergence and how it differs from KL Divergence??

JS Divergence is also used for comparing the differences between the probability distributions. Only, difference between JS and KL divergence is that in KL divergence the divergence of P from Q and divergence of Q from P aren’t same or we can say that divergence comparison by KL divergence is not symmetrical which is converted to symmetrical in JS divergence. i.e., in KL divergence we can write as:

KL(P || Q) != KL(Q || P)

This is symmetrical in JS divergence; we can write it as:

JS(P || Q) == JS(Q || P)

In which JS divergence is calculated as:

JS(P || Q) = 1/2 * KL(P || M) + 1/2 * KL(Q || M)
M = 1/2 * (P + Q)

Note: equation is taken from KL divergence written by Jason Brownlee.

Bonus question:

6. What does p — value signify about the statistical data?

7. what is Hypothesis and Hypothesis testing?

8. What is Confusion Matrix and types of Error?

to follow the above questions, read my article on:

Confusion Matrix

We have all studied about the matrices and vectors in our Schools and Colleges lectures. Well matrix is a kind of N…

medium.com

Probability and Likelihood

Probability is the exact outcome of certain events. In Probability you know what is the outcome of an occurring of an…

medium.com

for Hypothesis Testing:

Linear Regression and Fitting a Line to a data

Linear Regression is the Supervised Machine Learning Algorithm that predicts continuous value outputs. In Linear…

medium.com

Hope you liked it. If you want me to add anything or correct anything then do mention in comments and guide me for more questions like this. Most of my work I take reference from Krish Naik Sir, videos and from StatQuest. These are the two most productive and awesome Data Science channel on YouTube.

5 Question Series — Data Science & AI — 4

Confusion Matrix

We have all studied about the matrices and vectors in our Schools and Colleges lectures. Well matrix is a kind of N…

Probability and Likelihood

Probability is the exact outcome of certain events. In Probability you know what is the outcome of an occurring of an…

Linear Regression and Fitting a Line to a data

Linear Regression is the Supervised Machine Learning Algorithm that predicts continuous value outputs. In Linear…

Written by Asitdubey