#04 Feature Engineering: Principles for choosing right features


Akira Takezawa
6 min readFeb 6, 2019



Hola! Welcome to “Short-Cut Machine Learning Series”.

Target is who wanna know …

  • Reason: why is Engineering Features so so important?
  • Big Picture: must-know skills for Feature Selection and Extraction
  • Code: simplest python code ever

Why you have to read this?

As a machine learning Engineer, you will be required to show your skill particularly in following 3 steps below:

  1. Feature Engineering: select and extract features for feeding model
  2. Model Selection: compare and choose the best ML model
  3. Generalization: adjust your hyperparameter

So today’s topic is one of the most important topic, “Feature Engineering”. Let’s get started!

  1. Why is Feature Engineering always necessary for ML?
  2. The overall picture for Feature Engineering
  3. Type A: Feature Selection
  4. Type B: Feature Extraction (Dimensional Reduction)
  5. References

1. Why is Engineering Features necessary for ML?


Basically, machine learning algorithm fairly evaluates given features. As a result, the correlation is found up to features not logically related to labels.

And here are 4 vital benefits from feature engineering.

  • Improving the accuracy of ML model
  • Solving Overfitting problem
  • Speed up your computation
  • Understandability for ML process

2. The overall picture for Feature Engineering

In order to acquire above-mentioned benefits, basically, we are trying to reduce the number(dimension) of features. So in machine learning, we have two ways to reduce them.

  • Type A: Feature Selection
  • Type B: Feature Extraction (Dimensional Reduction)

I’m assuming you have already heard about mysterious words like PCA or LDA. Those techniques are inside this concept.

5 Keywords: pre-required concept

Before going deeper you better google some words and put on your mind at least names to understand this feature engineering topic.

  • Correlation
  • Dimensional Reduction
  • PCA
  • LDA
  • SVD

Code: Scikit-learn Module for Feature Engineering

Well, are you ready to start? Don’t worry, coding part of feature engineering is already well simplified by scikit-learn library. (Merci Google!)

# Feature Selection
# Feature Extraction (Dimensional Reduction)

3. Type A: Feature Selection

On the process of machine learning, we have two timing for selecting efficient features for our model. One is before training model, another is while training model. And I will explain must-know two methods for feature selection.

3–1. before training model

  • Statical method: Removing features with low variance
  • Filter method: Univariate feature selection

3–2. while training model

  • Wrapper method: Recursive feature elimination
  • Embedded method: L1-based feature selection

Now I will briefly explain each feature selection techniques with a few lines of code. I recommend reading other official definition of them as well for deep understanding. (and this is not my main concern)

3–1. before training model

I will use Boston dataset for this part. So let’s get started!

from sklearn.datasets import load_boston
boston = load_boston()
# X has 13 features
X, y = boston.data, boston.target

Statical method: Removing features with low variance


This means removing features with low variance. Since variations are small, changes in features do not affect our prediction model.

# we cab decide criteria for low variance by "threshold"
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold = (0.8 * (1 - 0.8)))
sel_X = sel.fit_transform(X)
print(X.shape, sel_X.shape)
>>> (506, 13) # Before
>>> (506, 11) # After

Now you can see features are reduced from 13 to 11!

Filter method: Correlation Thresholds feature selection


This means calculating the relationship between each explanatory variable(X) and objective variable(y). Then select only relevant features with the highest certainty factor.

# we can choose the number of features by "k"
from sklearn.feature_selection import SelectKBest, f_regression
sel = SelectKBest(score_func=f_regression, k=7)
sel_X = sel.fit_transform(X, y)
print(X.shape, sel_X.shape)
>>> (506, 13) # Before
>>> (506, 7) # After

3–2. while training model

I will use Iris dataset for this part. So let’s get started!

from sklearn.datasets import load_iris
iris = load_iris()
# X has 4 features
X, y = iris.data, iris.target

Wrapper method: Recursive feature elimination


While learning, remove features with small parameter weights in order and continue to loop until the feature decreases to the specified n number

# we can choose the number of features by "n_features_to_select"
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
estimator = SVR(kernel="linear")
sel = RFE(estimator, n_features_to_select=2, step=1).fit(X, y)
sel_X = sel.transform(X)
print(sel.ranking_, X.shape, sel_X.shape)
>>> [1 1 1 1 1 6 4 3 2 5] #
>>> (150, 4) # Before
>>> (150, 2) # After

Now you can see features are reduced from 4 to 2!

Embedded method: L1-based feature selection

from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
sell = SelectFromModel(lsvc, prefit=True)
sel_X = sel.transform(X)
print(clf.feature_importances_, X.shape, sel_X.shape)
>>> (150, 4) # Before
>>> (150, 2) # After

4. Type B: Feature Extraction

Here is the list of Feature Extraction Methods:

  • PCA (Principal Component Analysis)
  • LDA (Linear Discriminant Analysis)
  • SVD (Singular Value Decomposition)
  • TSNE (t-Distributed Stochastic Neighbor Embedding)
  • Word2Vec (Natural language processing)

So in this Article, I will Compare PCA and SVD with visualization.

Fortunately, we have all the functions in scikit-learn, sklearn.decomposition module. And I will use iris dataset here.

# load scikit-learn built in dataset "iris"
from sklearn import datasets
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target
>>>> output
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
.... # total 150 rows.

PCA (Principal Component Analysis)

  1. Subtract the mean from the data 全ての特徴量を平均で引く
  2. Scale each dimension by its variance 分散で各特徴をスカラー倍
  3. Compute the covariance matrix X 共分散行列Xを計算
  4. Compute K largest eigenvectors of X 最大となる固有値ベクトルの算出

Finally, these eigenvectors are the principal components for Features.

from sklearn.decomposition import PCA
pca = PCA(n_components=3)

Cumulative contribution rate indicates how correctly reduced dimension can describe features. The closer to 1 this metric is, it means you are succeeding for dimensional reduction of features.

# Cumulative contribution rate
>>> [0.92461872 0.05306648] # new 2 dimentions for features

LDA (Linear Discriminant Analysis) 線形判別分析

from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(2)
Akira Takezawa

