Feature Engineering — Automation and Evaluation — Part 2

Maher Deeb
KI labs Engineering
5 min readJun 25, 2019

In the previous article, I applied feature transformation to improve the linear relationship between the features and the target that I wanted to predict. I used a dataset from the Kaggle competition “House Prices: Advanced Regression Techniques” to illustrate and evaluate the new features. I used a simple regression model which is known as “Ridge” to predict the price of the houses "SalePrice". Applying feature transformation helps to improve the accuracy of the model by more than 2%.

In this part, I continue using the same dataset to create new features by applying another feature engineering technique, which is mapping features. As in the previous article, I use both R² and MSE to evaluate the model. When I speak about the accuracy of the model, I mean R²

Mapping features:

In addition to feature transformation, it is possible to try to catch the nonlinear relationship between the features and the target by mapping features. To explain what I mean by mapping features let us take the linear relationship between the target “y” and the features set [x] which looks like as below:

where the set [w] is the weights that should be obtained by training the model using an optimization method such as Gradient Descent.

After adding the new complex term by multiplying the values of the features together (element-wise multiplication), the function can look like below:

According to the equation above, mapping features generates complex polynomial functions which cover some of the nonlinear transformations, which I mentioned in the previous article, part 1. Mapping features helps to discover the influence of the interaction between the features on the model’s accuracy.

Implementation

The function below maps the given features together automatically. I have to define the highest polynomial degree “map_degree” and the highest degree of mixing term “terms_mix_degree”. For example, if I define the highest polynomial degree as 3, then I get for each feature x:

If the highest degree of mixing term is 3, then I get the combination of the following features:

where i, j, k are zeros or positive integers. If the number of features is large, it is important to apply the function on a limited number of features. Otherwise, the number of generated features is enormous. The total number of features can be calculated using the combination rule “C^R (n,r)”: the number of ways to choose a sample of “r” elements from a set of “n” distinct objects where order does not matter, and replacements are allowed:

It is possible to limit the features that are considered for the mapping features process by defining a list “features_numbers_list”, which contains the indexes of the desired features.

If the “map_degree=3” and “terms_mix_degree =3 “, The equation above has to be applied to each “map_degree = 1, 2 , 3” and each “terms_mix_degree =1, 2, 3“. The total number of features is the sum of the results.

Application

The function below applies the “features_mapping” function to the given datasets. For simplification purpose, I change the name of the columns to “col_i”, where “i” is zero or a positive integer. The output is a dataframe that contains the original features and the newly created features using element-wise multiplication.

In this example, I tested multiple combinations of “map_degree” and the “terms_mix_degree”. The results are in table 1.

There are interesting results here:
1. Adding more features increases the risk of over-fitting. Therefore, using regularization or similar methods to avoid over-fitting is important.
2. Testing multiple setups is very important to obtain the best combination of the parameters that I am using to create new features.

Table 1: The accuracy of the model for multiple combinations of “map_degree” and the “terms_mix_degree

The chosen model

Let us consider the best feature mapping setup from table 1. I chose both the “map_degree” and the “terms_mix_degree” as 2. Below I get the results of the best model considering the regularization factor alpha = 197. I obtained about 5% accuracy improvement by mapping features together.

foldnr. 1
Mean squared error linear: 0.03
R2 linear: 0.83
foldnr. 2
Mean squared error linear: 0.03
R2 linear: 0.82
foldnr. 3
Mean squared error linear: 0.02
R2 linear: 0.83
foldnr. 4
Mean squared error linear: 0.03
R2 linear: 0.81
foldnr. 5
Mean squared error linear: 0.03
R2 linear: 0.81
mean R2 5 Folds: 82.11
mean MSE 5 Folds: 0.0284

Conclusion:
Mapping features captures the influence of the interaction between the features on the accuracy of the model. Testing multiple setups helps to figure out what is the optimal number of features to improve the accuracy of the current model. For the given dataset, I managed to improve the accuracy of a simple linear model by about 5% compared to the baseline model.

In all cases, it is important to keep an eye on the over-fitting problem and use regularization methods to avoid a tremendous degradation of the model quality after adding new features.

In the next part, I explore some techniques to create new features based on statistics. I use an open source tool featuretool to automate the feature engineering process.

--

--

Maher Deeb
KI labs Engineering

Senior Data Engineer/Chapter Lead Data Engineering @ KI performance