From feature engineering to feature learning in machine learning

Published in

Analytics Vidhya

6 min readOct 28, 2019

Machine learning encompasses many aspects from data acquisition to visualisation. In this article, we will explain by example two of them, feature learning and feature engineering, using a simple example based on the Galton dataset.

Galton is one of the parents of linear regression. He has published in 1886 a paper entitled “Regression toward Mediocrity in Hereditary Statures”. He then questioned why there seems to be very little links between parents heights and children heights. In his own words :

“It appeared from these experiments that the offspring did not tend to resemble their parent seeds in size, but to be always more mediocre than they — to be smaller than the parents, if the parents were large; to be larger than the parents, if the parents were very small.”

Part of Galton legacy is a dataset of human people heights versus parents’ heights that we will use in this article to predict children’s height with linear regression using Python.

We will progress in three experiments:

Initial linear regression on the parents’ heights using Scikit Learn
Use some simple feature engine to account for the children’s gender
Get a neural network to learn this feature using a two layer system in Keras

You will find the full code of this article in this Notebook : HTML / Jupyter

Linear regression on the two parents

In a first step, let’s perform linear regression “out of the box” using as input the parents’ heights and as label, that’s the variable to predict, the children’s height.

Anticipating the comparison to the neural network afterward, the data is normalized before the regression : mean is removed and the variance is scaled to equal 1 for each feature.

Code excerpt in Python using SKLearn’s regression:

trainX_scaled = scalerX.transform(df_train[['Mother', 'Father']])
trainY_scaled = scalerY.transform(df_train[['Height']])model1 = linear_model.LinearRegression()model1.fit(trainX_scaled, trainY_scaled)
b1 = model1.intercept_
w1 = model1.coef_.reshape(-1)

With result : intercept b1 = 0.0, weights w1 = (0.184, 0.217)

Initial regression’s predictions on the test set

The bivariate regression has drawn an optimal plan that does not seem to fit very well the labels to be predicted : reference points in blue are spread around the predictions in orange.

For more explanations on Linear regression, see the Bivariate function approximation notebook : HTML / Jupyter

Linear regression model taking into account for gender

We may have the intuition that the model is not taking into account for a major information : sex distribution is dependent on the children’s sex.

So let’s do some feature engineering and create two models :

A model based on the girl childs of the train dataset that will predict the height of girls
Similar model for boys

trainX_girls_scaled = scalerX.transform(df_train[['Mother', 'Father']][girls_train])
trainY_girls_scaled = scalerY.transform(df_train[['Height']][girls_train])model2_girl = linear_model.LinearRegression()
model2_girl.fit(trainX_girls_scaled, trainY_girls_scaled)
trainX_boys_scaled = scalerX.transform(df_train[['Mother', 'Father']][boys_train])
trainY_boys_scaled = scalerY.transform(df_train[['Height']][boys_train])model2_boy   = linear_model.LinearRegression()
model2_boy.fit(trainX_boys_scaled, trainY_boys_scaled)

These models are then applied to the test dataset depending on the sex of the children whose height is to be predicted.

With result:

Fitting girls, intercept = -0.773, weights = 0.207, 0.283
Fitting boys, intercept = 0.710, weights = 0.209, 0.268
Mean squared error on test dataset = 5.123

The intercepts are no longer 0 as the normalization was performed on the full dataset mixing girls and boys.

The Mean Square error has been decreased by 60%, confirming our intuition on the combination of models.

Feature learning using neural network

Would the machine be intelligent enough to guess this result ? That is to learn the feature combining the parents’ heights and the children’s sex ?

We are now switching to a neural network, and in consequence to gradient based optimization or “machine learning”.

We will use Keras. It provides a standard and simplified programming interface to TensorFlow and other machine learning frameworks.

In our network, two layers are used :

Initial layer is made of two neurons that will each take four features as input: mother and father heights, sex encoded using “One-hot encoding”
Second layer is combining the two outputs of previous neurons to provide the final prediction

model = keras.models.Sequential([
    keras.layers.Dense(2, activation=’linear’, 
        input_shape=[4],
        kernel_regularizer=keras.regularizers.l2(0.0001)),
    keras.layers.Dense(1, activation=’linear’, 
        input_shape=[4],
        kernel_regularizer=keras.regularizers.l2(0.0001))
])
model.compile(optimizer=’adam’,
    loss=keras.losses.mean_squared_error,
    metrics=[‘mse’])

One-hot scheme is used to encode the sex on two bits (two values) as in following table. It is some feature engineering necessary to get a linear relationship between the sex and the prediction.

All neurons are behaving as the predict function of the linear regression: they perform the weighed sum of the inputs and add an intercept.

In consequence, the number of coefficients is :

2 * 4 +2 for the first layer
2 + 1 for the second layer

Yet another difference to the previous experiment, coefficients are identified iteratively in minimizing the cost function (the mean square error), minimization is based on the computation of the gradient of this function.

Gradient descent loss function and the mean square error (MSE)

For more explanations on the gradient descent with Keras, see the Bivariate function approximation with Keras notebook : HTML / Jupyter

The Mean Square Error of this experiment is 5.167, which is very close to the one with the model based on independent regression for girls and boys.

Measured and predicted heights as function of each parent’s height using a 2 layer Neural Network

Machine has been able to learn the feature !

Does it mean that the machine has intentionally detected the relationship between sex and height ?

Not exactly as we would expect as we will show in next section.

Comparison of the two “gendered” models

The prediction using the two alternative linear regression models could also be implemented as a neural network with two layers :

First layer has two neurons, one for each linear regression model. Only one model will “activate” that is produce a non-null output based on the sex (one hot encoded)
Second layer is a simple adder of the two outputs of layer 1

Using this topology, we may now compare the two solutions using the first child of the test sample as example (and an example optimization run).

Coefficient comparison for the combined regression model and the 2 layer neural net

On above tables, we see that the coefficients of the two networks are quite different in magnitude and sign. For the auto trained net, the neuron 2 coefficients are very low, it acts as a correction on the neuron 1.

Theory says that for a linear regression, in this context, the regression optimum is unique. Obviously, in the case of the neural nets there are several optima.

Conclusion

Starting from a simple experiments, we have taken a few engineering step to improve it :

Feature engineering to One-hot encode the sex and combine two linear regression model
Feature learning using a neural network and gradient descent optimization to learn the relationship between the sex and the height

If you enjoy this post, visit my Github repository with more notebooks on “Learning data science step by step”.

And add a star to the project to raise its visibility.