Feature preprocessing for Machine Learning
In this blog , I will cover the basic approaches to feature preprocessing for various types of features and various types of models. For now, let us divide all machine learning models into tree-based and non-tree based models. For example, decision trees classifier tries to find the most useful split for each feature, and it won't change its behaviour and its predictions these are tree-based models. On the other side, there are non-tree based models which depend on these kind of transformations such as nearest neighbours, linear models, and neural network. We also have different types of features, namely, numerical features, categorical features and datetime features. First, let us start with numerical features.
Numerical features
Preprocessing techniques used for numerical features are:
Scaling:
In case of non-tree based models , it is almost always required to rescale data. For numerical variables, it is common to either normalize or standardize your data. What do these terms mean?
- Normalization: It means transforming features by scaling each feature to a given range. The transformation is given by:
X_norm = (X - X.min) / (X.max - X.min)
Scikit-learn provides sklearn.preprocessing.MinMaxScaler function for this.
2. Stardardization: Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance). It means to center the data around 0 and scale with respect to the standard deviation:
X_stand=(X−μ)/σ
Scikit-learn provides sklearn.preprocessing.StandardScaler function for this.
Scaling data with Outliers: If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. In these cases, you can use robust_scale
and RobustScaler
as drop-in replacements instead. They use more robust estimates for the center and range of your data.
Rank:
Here data values are transformed to an array of indices. Linear models, KNN, and neural networks can benefit from this kind of transformation if we have no time to handle outliers manually.
Scipy function scipy.stats.rankdata is used for this.
Log transform: This often helps non-tree based models and especially neural networks. It is simply logarithmic transform. It can be implemented by numpy log function
Raising to power: It can be used to extract a square root of the data. This can be useful because it drives too big values closer to the features’ average value. Along with this, the values near zero are becoming a bit more distinguishable. Despite the simplicity, this transformations can improve neural network’s results significantly.
To this end, we have discussed numeric feature preprocessing, how model choice impacts feature preprocessing, and what are the most commonly used preprocessing methods. Let’s now move on to categorical and ordinal features.
Categorical and Ordinal features
Now, the question arises- what are categorical and ordinal features? Categorical feature is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender(male and female) and hair color(blonde, brown, brunette, red, etc.). Although, ordinal feature is similar to a categorical variable, the difference between the two is that there is a clear ordering of the variables. For example, suppose you have a variable, economic status, with three categories (low, medium and high)or educational experience (with values such as elementary school graduate, high school graduate, some college and college graduate). Preprocessing methods used for these features are
Label encoding:
This can be done by several ways. One is by encoding in alphabetical or sorted order. Which can be done by using scikit-learn function sklearn.preprocessing.LabelEncoder.
Another way is by encoding the categories in order of their appearance using pandas function — pandas.factorize
. It is used for tree-based models.
Frequency encoding:
As the name suggest it is encoding according to the frequency of appearance of features. This holds good for tree-based models.
One-hot encoding:
In this method a new column is created for each category, then put one in appropriate place and everything else will be zero. This works well for linear methods, kNN or neural networks. It has one benefit that is, values here are scaled because minimum is 0 and maximum is 1. sklearn.preprocessing.OneHotEncoder
function can be used to implement this.
Next move on to datetime features.
Datetime features
Date and time is an important feature in machine learning. This is periodic, that can be day number in week, month, season, year, minute and hour. It can be useful to analyse a particular period or calculate difference between two timestamps. Python has a built-in module called datetime
which makes working with date and time features easy.
Hope you find this beneficial. Please comment if you have any suggestions or questions. Thank you!!