Feature Engineering — Automation and Evaluation — Part 1

Published in

KI labs Engineering

9 min readJun 14, 2019

Introduction

This is the first post of a series that I am writing about feature engineering for Data science. In this series, I want to explore different techniques for creating and evaluating features with and without applying popular frameworks for creating and selecting features.

Data science pipeline
Building an end-to-end data science pipeline for a specific problem is an iterative process which involves a combination of different processes, figure [1]. First, data scientists upload data from various sources and prepare it for the preprocessing phase. Performing EDA, also known as “Exploratory data analysis,” is a crucial step to get to know the data. Moreover, EDA helps a lot when creating new features. In the first iteration, we choose a simple and time-efficient method to train the first baseline model. The evaluation process helps to find out where the baseline model stands concerning the accuracy and the business value. Deploying a model in early stages helps to monitor its performance in the production environment and to collect feedback to improve the model in the next iterations.

Feature engineering
In this article, I am going to “scratch the surface” of the feature engineering process. In Wikipedia, we can find the definition of feature engineering as follows:

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning.

From the definition above, I want to highlight the following points:
1. Feature engineering is expensive and time-consuming because it depends on the problem and domain knowledge.
2. Feature engineering can be a manual process but in some cases…
3. Feature engineering process can be automated.

Creating new features
There are many methods for creating new features, including brainstorming, previous experience, domain knowledge, and not to forget the automated feature engineering tools such as featuretool. How can I tell if the new features are useful or I am wasting my time? Here, I am not using any feature selection tool because I want to explore how the new features affect the model that I am using so that I have a better understanding of the quality of features.

Here, I dig deep to find out how to tell if the newly created features may improve the model or not.

What are you going to read here?
In the first part of this series, I explore the feature transformation technique to show how to improve the accuracy of a simple linear regression model by more than 2%. I am going to use the dataset from the Kaggle competition "House Prices: Advanced Regression Techniques" to explore some simple techniques for generating new features automatically. Eventually, I am going to present a simple idea that helps to evaluate the new features. In this article, I focus only on continuous numeric features. The dataset can be found here.
The code for this article is in Python. It is important to mention that this article is not about techniques for getting the best model ever.

Feature Transformation

Feature-Target relationship
It is rare to have a linear relationship between a feature and the target that I want to predict. If I use a linear model to represent a nonlinear relationship, I may have an under-fitting problem. It means that the model is not complex enough to represent the relationship between the feature and the target. Before moving to more complicated models such as tree-based models or neural network models, I can try to introduce the possible nonlinear relation between the features and the target into the original data without increasing the complexity of the model. At this point, many questions and challenges pop up.

Challenges
How can I know the relationship between a specific feature and the target is? How can I guarantee that if I perform the transformation, the accuracy of my model improves?
One of the popular methods to discover the nonlinear relationship is to plot the feature’s values against the target’s values. If we notice a particular shape, such as a curve in this plot, we can try to approximate this curve with a popular mathematical function. However, if I have a ton of features, it can take forever if I follow the plot-based procedure to choose the best transformation. Besides that, if a feature-target relationship is linear, applying a transformation can destroy this relationship and reduce the quality of the model.

Solutions
Although feature transformation seems to be a difficult challenge, it is still possible to overcome this challenge easily. In this case, statistics helps a lot. It is possible to use the correlation between the features and the target before and after applying the transformation to tell if there is a chance to improve the accuracy of the model even before training the model.

Which transformation should I choose? Well, the best way to know is to try. There are already some well-known transformation to try first:

where a is a real number but not 0 a!=0.

Another transformation to try is the “log”:

Applying Feature Engineering to the "House Prices" dataset

Reading data
When downloading the data from the link above, I get two datasets: the train and the test datasets. The train dataset contains 81 columns and 1460 samples (rows) while the test dataset has 80 columns and 1459 rows. The additional column in the train dataset is the target (label) “SalePrice” that I want to predict by solving the regression problem.

The dataset contains both numeric and categorical features. There are 43 categorical features with string labels. The rest of the features are numeric. Additionally, the description of the features is given in a text file that can be downloaded together with the data. Here I focus only on the training dataset. Before submitting to Kaggle, I need to apply the next steps to the test dataset as well. In the code below, I load the training dataset using Pandas framework.

The target and the evaluation criteria

Kaggle defines the evaluation process as follows: Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

Therefore, I extracted the target “SalePrice” from the train dataframe and then applied the log transformation. I dropped the target column and Id column, which is unique for each sample, from the train dataframe.

The Features
Numeric features can have continuous values such as the length and the area. Categorical numeric features have discrete numeric values. For example, in the dataset here, there is a feature called “MSSubClass” which has only 15 categories (values = 20, 30, 40, etc.), that identifies the type of dwelling involved in the sale of a house. However, there are some numeric categorical features which have a large number of categories. In this case, it is hard to distinguish between the continuous numeric and categorical numeric features. I defined a threshold which refers to the maximum categories that a categorical feature should have. If a categorical feature has more categories than the threshold, I consider it as a continuous numeric feature. For this article, I set the threshold = 30. In this case, I consider only the following numeric features in the next steps:

Preprocessing
Before training the model or applying feature transformation, the data is normalized using the “StandardScaler” from Scikit-learn framework. After the normalization, the features have zeros mean and 1 as standard deviation. I decided to replace the missing values (NaN) with the mean values which should be zeros after the normalization. This may not be the best setup for the current dataset, but it is enough for the goal of the experiment in this article. If you are interested in testing different setups, e.g., using a different model type or filling out missing values with other values, please go ahead.

Feature transformation
I use the function “apply_transformation” below to apply the transformation above to all features automatically. After applying each transformation, I calculate the correlation between the features and the target and compare it to the original case where I don’t apply any transformation. Table 1 shows that applying some transformation to specific features improves the relationship significantly — check “LotArea: 13% correlation improvement” and “OpenPorchSF: 9% correlation improvement” as examples. In contrast, applying the transformation to other features did not improve the correlation, or even, in some cases, the correlation decreases — check “BsmtFinSF1” and “GarageArea” as examples.

Table 1: The correlation between the chosen features and the target “SalePrice” without and with applying different transformations

Evaluate feature transformation

Baseline model
I use a simple linear model to predict the target “SalePrice”. The Scikit-learn framework has a different implementation of the linear model. To avoid over-fitting, I chose the linear model, which is known as “Ridge”, that support regularization.

Validation Process

Here I use both R² and MSE to evaluate the model. When I speak about the accuracy of the model, I mean R².

Since the dataset size is small, splitting the datasets to training and testing datasets may not be sufficient for reliable accuracy estimation. The reason is that the testing dataset, in this case, is small and can not represent the whole dataset. Therefore, I use KFold method, which implemented in Scikit-learn, to evaluate the quality of the model. The quality changes dramatically every time I have a new split as the results of KFold below confirm. I am going to use the mean values of R² and mean squared error calculated from 5 folds to evaluate the accuracy of the model before and after adding new features. After running the code below, I get the accuracy of the baseline model as follows:

foldnr. 1
Mean squared error linear: 0.0310
R2 linear: 82.2
foldnr. 2
Mean squared error linear: 0.0672
R2 linear: 62.0
foldnr. 3
Mean squared error linear: 0.0282
R2 linear: 81.0
foldnr. 4
Mean squared error linear: 0.0335
R2 linear: 77.1
foldnr. 5
Mean squared error linear: 0.0238
R2 linear: 84.1
mean R2 5 Folds:  77.27
mean MSE 5 Folds:  0.0368

It is important to mention that I search for the best “alpha” value— regularization factor in line 5 in the code above— that gives the best model accuracy. This is applied in the next steps as well.

Train the model with feature transformation
From table 1 above, I choose only the transformations that improved the correlation between the feature and the target, as shown in table 2 below. I applied the transformation to each feature separately first to observe the variation of the accuracy of the model if I only transform a particular feature each time. In the second step, I apply some of those transformations together to obtain the highest accuracy. I present the results in table 2 below. The value “1” in the table means no transformation was applied, and the original value (normalized) was considered.

Table 2: Model’s accuracy variation after applying transformations. “1” means no transformation was applied

It is obvious that most of the features, which showed a better correlation with the target after applying the transformations, contributed to improving the quality of the model. On the other hand, applying the transformation to some features reduces the accuracy of the model. Such interesting results may not be interpreted easily since many factors affect the outputs. Some of those factors may be that the gradient descent got stuck in a local minimum or the dependency between the features increased after applying the transformation. That means the transformation doesn't provide the model with new information. Moreover, the regularization factor changes dramatically after applying the transformations. If the transformation improves the accuracy of the model significantly (more than 1%), the regularization factor “alpha” is small compared to other cases.

Best Feature combination
Let us ignore the features that reduced the accuracy of the model and combine the transformations that improved the accuracy of the model and train the model again. As I present below, I was able to improve the accuracy of the model about 2.25 % by applying some simple transformations.

foldnr. 1
Mean squared error linear: 0.0321
R2 linear: 81.5
foldnr. 2
Mean squared error linear: 0.0561
R2 linear: 68.3
foldnr. 3
Mean squared error linear: 0.0257
R2 linear: 82.7
foldnr. 4
Mean squared error linear: 0.0288
R2 linear: 80.3
foldnr. 5
Mean squared error linear: 0.0226
R2 linear: 84.9
mean R2 5 Folds:  79.52
mean MSE 5 Folds:  0.0331

Conclusion

Feature engineering is an essential component in an end-to-end data science pipeline. It is an iterative process which every data scientist should master to improve the accuracy of the models — even simple models. Feature engineering can be expensive and time-consuming.

In this article, I applied feature transformation to continuous numeric features. The correlation between the features and the target that I want to predict could be an important indicator to select the best transformations for the features. Nonlinear feature transformation can destroy the relationship between the feature and the target if the actual relationship is closer to linear than nonlinear.

In the second part of this series, I explore the feature mapping technique to find out how the interaction between the features can improve the model by about 5%.