Feature Transformation for Machine Learning, a Beginners Guide
When first starting to learn how to optimise machine learning models I would often find, after getting to the model building stage, that I would have to keep going back to revisit the data to better handle the types of features present in the dataset. Over time I have found that one of the first steps to take before building the models is to carefully review the variable types present in the data, and to try to determine up front the best transformation process to take to achieve the optimal model performance.
In the following post I am going to describe the process I take to identify, and transform four common variable types. I am going to be using a dataset taken from the “machine learning with a heart” warm up competition hosted on the https://www.drivendata.org/ website. The full dataset can be downloaded here https://www.drivendata.org/competitions/54/machine-learning-with-a-heart/data/. DrivenData host regular online challenges that are based on solving social problems. I have recently started to engage in some of these competitions in an effort to use some of my skills for a good cause, and also to gain experience with data sets and problems that I don’t usually encounter in my day to day work.
Identifying Variable Types
In statistics numerical variables can be characterised into four main types. When starting a machine learning project it is important to determine the type of data that is in each of your features as this can have a significant impact on how the models perform. I have tried to give a simple description of the four types below.
- Continuous variables are variables that can have an infinite number of possible values, as opposed to discrete variables which can only have a specified range of values. An example of a continuous variable would be the number of miles that a car has driven in its lifetime.
- Nominal variables are categorical values that have 2 or more possible values, but in which the order of those values have no meaning. For example we might use a numerical representation to interpret types of cars say compact has a value of 1, MPV has a value of 2 and the convertible has a value of 3. However, the fact that the compact car has a value of 1 and the convertible has a value of 2 does not mean that mathematically the convertible group is in someway larger than the MPV. It is simply a numerical representation of the category.
- Dichotomous variables are again categorical but only have 2 possible values usually 0 and 1. For example we might categorise car ownership as 1 (meaning yes) or 0 (meaning no). When we convert variables into dummy columns (which we will do later in this post) the new features produced also become dichotomous.
- Ordinal variables are similar to nominal in that they have 2 or more possible values, the primary difference is that these values have a meaningful order or rank. So in our car example this might be something like engine size where these categories could be ordered in terms of power, 1.2, 1.6, 1.8.
Preparing the data
I am going to use our machine learning with a heart dataset to walk through the process of identifying and transforming the variable types. I have downloaded and read the csv files into a Jupyter Notebook. Next I run the following function to get a snapshot of the composition of the data.
import pandas as pddef quick_analysis(df):
print(“Rows and Columns:”)
print(df.apply(lambda x: sum(x.isnull()) / len(df)))quick_analysis(train)
This produces the following output.
This tells me that I have a small dataset of only 180 rows and that there are 15 columns. One of the features is non-numeric, and will therefore need to be transformed prior to applying most machine learning libraries. There are no null values so I don’t need to worry about treating those. Before processing the dataset I also drop the “patient_id” column for now as that is non-numeric and will not be used in any of the training or prediction stages.
I then run the pandas describe function to produce some quick descriptive statistics.
To categorise the variable types in the dataset I run the following code which produces histograms of all the numerical features. You can easily see from the resulting output which features are continuous and dichotomous. The continuous features display a continuous distribution pattern, whilst the dichotomous features have only two bars. The nominal and ordinal variables can sometimes be trickier to determine, and may require some further knowledge of the dataset or some specific domain knowledge. In the case of a machine learning competition such as this I would suggest referring to any data dictionary that may be supplied, if there isn’t one (as is the case here) then a combination of intuition and trial and error may be needed.
import matplotlib.pyplot as plt
I have characterised the features into the four types in the table below. I can now make some decisions as to the transformation steps I will take in order to prepare the data for training and prediction.
As mentioned earlier in this post any non-numerical values need to be converted to integers or floats in order to be utilised in most machine learning libraries. For low cardinality variables the best approach is usually to turn the feature into one column per unique value, with a 0 where the value is not present and a 1 where it is. These are referred to as dummy variables.
This technique is also usually best applied to any nominal variables. As these have no intrinsic order, if we don’t apply this first, the machine learning algorithm may incorrectly look for a relationship in the order of these values.
Pandas has a nice function for this called get_dummies(). In the below code I have used this to convert all nominal and non-numeric features into new columns. You can see from the output that several new columns have been created and the original columns have been dropped.
dummy_cols = ['thal', 'chest_pain_type', 'num_major_vessels',
train = pd.get_dummies(train, columns = dummy_cols)
The continuous variables in our dataset are at varying scales. For instance if you refer back to the histograms above you can see that the variable “oldpeak_eq_st_depression” ranges from 0 to 6, whilst “max_heart_rate_achieved” ranges from 100 to 200. This poses a problem for many popular machine learning algorithms which often use Euclidian distance between data points to make the final predictions. Standardising the scale for all continuous variables can often result in an increase in performance of machine learning models.
There are a number of methods for performing feature scaling in python. My preferred method is to use the Sci-Kit Learn MinMaxScaler function. Which transforms the scale so that all values in the features range from 0 to 1. I have included some code that does this below.
from sklearn import preprocessingn_test = train[['serum_cholesterol_mg_per_dl','max_heart_rate_achieved',
cols_to_norm = ['serum_cholesterol_mg_per_dl','max_heart_rate_achieved',
'oldpeak_eq_st_depression', 'resting_blood_pressure']x = n_test.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
n_test = pd.DataFrame(x_scaled, columns=cols_to_norm)
l_test = train.drop(['serum_cholesterol_mg_per_dl','max_heart_rate_achieved',
'oldpeak_eq_st_depression', 'resting_blood_pressure'], axis=1)
train = pd.concat([n_test, l_test], axis=1)
You will notice from the code above that I did not include the continuous variable “age” in the feature scaling transformation. The reason for this is that age is an example of a feature type that might benefit from transformation into a discrete variable. In this example we can use bucketing or binning to transform the feature into a list of meaningful categories.
In the code below I have specified intuitive categories based on the distribution in the data. This uses the pandas cut function which takes in a list of bins, group_names and the data frame. This functions returns the original data frame with a new “age_categories” feature. This column can then be turned into a number of dummy columns using the method previously described.
bins = [30, 40, 50, 60, 70, 80]group_names = ['30-39', '40-49', '50-59', '60-69', '70-79']age_categories = pd.cut(train['age'], bins, labels=group_names)
train['age_categories'] = pd.cut(train['age'], bins, labels=group_names)
What we now have is a dataset where all columns are non-numeric. We have created several new features, and transformed existing features into formats that should help to improve the performance of any machine learning models we may now use. Feature transformation is an important first step in the machine learning process and this can often have a significant impact on model performance. I have outlined here the first steps I would take in the process to logically think about how to treat the different variables I have. Once in the model building phase I will almost always go back and tweak the data using different methods to try to boost the accuracy of the models. However, I find that by following these steps at the beginning this often reduces the time I spend going back to the transformation stages.