Feature Selection and Analysis in Machine Learning and Data Science
When given arbitrary data to use for a machine learning model, it is important to identify what it represents so that features can be selected, processed and cleaned accordingly. The assumption here is that the data has columns which are ‘features’ and no information about the type of data and what it represents is known; whether it is a table of product prices, user information, the temperature at different times, etc.
At this point, certain missing data or data which is not suitable for modeling from the perspective of data analysis can be identified and discarded to improve performance. Also, calculating certain statistical measures can make it easier for a data scientist/expert to make quick decisions about which columns should be used for the Machine Learning model for a particular case.
For this purpose, this feature analysis algorithm was developed. It categorizes the input columns into 3 types - Identifier, Numerical (Int/Float) Feature and Categorical Feature and displays vital statistical measures to be used for further analysis. It also displays the correlation between columns so that redundant ones can be dropped.
Stage 1 — Identifying Type of Column
- Identifier refers to the column which has unique and not null values. This often is not an attribute which will give us any information for a row in conjunction with other rows.
- Numerical Features are individual measurable properties or characteristics which are numerically expressed as integers if discrete, or floats if continuous.
- Categorical Features are individual measurable properties or characteristics which can take a value from a defined finite set of values. They can be further categorized into Nominal and Ordinal. Nominal categorical features do not have an ordering on the defined finite set of values while Ordinal features do.
Realistic Problems
In practice, often categorical features are converted into numbers. E.g., the colour of products might be encoded as integers. Doing so imposes an ordering on the feature, even when it might not have one. To discern whether this ordering is imposed or inherent requires knowledge about the data. But when such an encoding is done, usually the encoding is as integers. Therefore, one can be sure that a numerical continuous feature would not be an encoding of a categorical feature. So, only the features tagged as ‘numerical (int)’ need to be checked by an expert.
While further identification of Ordinal Features from Nominal Features would be beneficial, at this level of abstraction, it is difficult to do so. If an ordinal
feature does not have a numerical data type, it is indistinguishable from a nominal feature as the ordering is unknown.
Approach Followed
For each column, count the total no. of unique values and the no. of null values. If the column has no repeated values and no null values, it is tagged as an Identifier Column. Else, it is tagged according to the data type as Categorical, Numerical (Int) or Numerical (Float). For these types, a frequency bar graph/histogram is plotted and displayed.
Stage 2 — Calculating Statistical Measures for individual features
For Numerical Features (both Int and Float)
The following are calculated :
- Mean (μ)
- Median (M) and the other Quartiles (Q1, Q3)
- Mode (Mo) and the modal frequency
- Range i.e. Min and Max
- Standard Deviation
- Coefficient of Variation — The coefficient of variation (CV), also
known as relative standard deviation (RSD), is a standardized
measure of the dispersion of a probability distribution or frequency
distribution. It is often expressed as a percentage, and is de-
fined as the ratio of the standard deviation σ to the mean μ.
7. Skewness — Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
- Bowley Skewness — Bowley skewness is a way to figure out if you have a positively-skewed or negatively skewed distribution. It’s very useful if you have extreme data values (outliers) or if you have an open-ended distribution.
- Pearson’s Coefficient of Skewness — The coefficient compares the sample distribution with a normal distribution. The larger the value, the larger the distribution differs from a normal distribution. Pearson’s first coefficient of skewness uses the mode. Therefore, if the modal frequency is not high enough, it won’t be a stable measure of central tendency.
A value of zero means no skewness at all. A large negative value means the distribution is negatively skewed. A large positive value means the distribution is positively skewed.
8. Mean Absolute Deviation — The mean absolute deviation (MAD), also referred to as the “mean deviation” or sometimes “average absolute deviation”, is the mean of the data’s absolute deviations around the data’s mean: the average (absolute) distance from the mean.
The MAD is a simpler measure of variability than the standard deviation and is therefore easier to understand. This method’s forecast accuracy is very closely related to the mean squared error (MSE) method which is just the average squared error of the forecasts. Although these methods are very closely related, MAD is more commonly used because it is easier to compute (avoiding the need for squaring).
10. Kurtosis — It is a measure of the “tailedness” of the probability distribution of a real-valued random variable. Distributions with large kurtosis exhibit tail data exceeding the tails of the normal distribution (e.g., five or more standard deviations from the mean rather than the normal three). Distributions with low kurtosis exhibit tail data that are generally less extreme than the tails of the normal distribution.
11. Percentage of Outliers — In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate an experimental error. The latter are sometimes excluded from the data set as an outlier can cause serious problems in statistical analyses. Calculating the percentage of outliers, therefore, gives important insights.
- Check the percentage of values which are outside (μ−3∗σ, μ+ 3∗σ).
- Check the percentage of values which are outside (2.5∗Q1−1.5∗Q3, 2.5∗Q3−1.5∗Q1).
For Categorical Features
To measure qualitative variation, Wilcox developed indexes which are analogous to quantitative measures. These indexes use the frequencies (f) of the K categories around the modal frequency (f subscript m) to quantify statistical dispersion. The following Wilcox’s indexes are calculated to gain an understanding of the extent of statistical dispersion of the categorical feature :
- Deviation from the Mode (DM) — According to Wilcox, ‘the measure can be thought of as an index of deviation from the modal frequency, analogous to the variance as a measure of deviation from the mean’.
2. RanVR — Variation of the Range around the Mode. The formula is the lowest frequency divided by the modal frequency.
3. ADA (average deviation analog) — an analog of the average or mean deviation. It is defined as the arithmetic mean of the absolute differences of each value from the mean.
4. VA (variance analog) — It is the same index as Mueller and Schussler’s Index of Qualitative Variation and Gibbs’ M2 index. It is distributed as a chi square variable with K — 1 degrees of freedom.
5. Shannon’s Entropy ( HREL Index ) — This index was originally developed by Claude Shannon for use in specifying the properties of communication channels. This is equivalent to information entropy divided by the log(K) with base 2 and is useful for comparing relative variation between frequency tables of multiple sizes.
Stage 3 — Calculating Correlation Matrix
A correlation matrix is a table showing correlation coefficients be-
tween sets of variables. It is calculated to observe the association
between features so that features which have a strong association
among themselves can be identified. For Machine Learning models,
out of the features with strong correlation, one can be selected.
As there are two types of features, Numerical and Categorical Features, therefore three different correlation matrices need to be generated — Numerical/Numerical, Numerical/Categorical and Categorical/Categorical.
Numerical/Numerical Correlation Coefficient
The Pearson Correlation Coefficient is a measure of the strength and direction of the linear relationship between two variables. This is the best known and most commonly used type of correlation coefficient for numerical variables. It is defined as the covariance of the variables divided by the product of their standard deviations.
Categorical/Categorical Correlation Coefficient
Based on Pearson’s coefficient, Cramer’s V is is a measure of association between two categorical variables. Cramér’s V varies from 0 (corresponding to no association between the variables) to 1 (complete association) and can reach 1 only when the two variables are equal to each other.
Let there a sample of size n of the simultaneously distributed variables A and B for i = 1 , … , r ; j = 1 , … , k ; be given by the frequencies
The chi-squared statistic is:
Cramér’s V is computed by taking the square root of the chi-squared statistic divided by the sample size and the minimum dimension minus 1.
where :
- φ is the phi coefficient
- χ2 is derived from Pearson’s chi-squared test
- n is the grand total of observations
- k being the number of columns.
- r being the number of rows.
Numerical/Categorical Correlation Coefficient
Most methods to determine the correlation between numerical and categorical features require heavy computation, which would be quite costly with large data. Therefore, the following approach has been developed:
- Calculate the range of the numerical feature for each category of the categorical feature.
- Use this to calculate the Jaccard similarity coefficient between the categories.
3. Take the average and subtract it from 1 to get a measure of how correlated the features are.