#04 Feature Engineering: Principles for choosing right features

特徴量生成の大原則

Akira Takezawa
Coldstart.ml
6 min readFeb 6, 2019

--

unsplush

Hola! Welcome to “Short-Cut Machine Learning Series”.

Target is who wanna know …

  • Reason: why is Engineering Features so so important?
  • Big Picture: must-know skills for Feature Selection and Extraction
  • Code: simplest python code ever

— — —

Why you have to read this?

As a machine learning Engineer, you will be required to show your skill particularly in following 3 steps below:

  1. Feature Engineering: select and extract features for feeding model
  2. Model Selection: compare and choose the best ML model
  3. Generalization: adjust your hyperparameter

So today’s topic is one of the most important topic, “Feature Engineering”. Let’s get started!

— — —

Menu

  1. Why is Feature Engineering always necessary for ML?
  2. The overall picture for Feature Engineering
  3. Type A: Feature Selection
  4. Type B: Feature Extraction (Dimensional Reduction)
  5. References

1. Why is Engineering Features necessary for ML?

https://www.pinterest.com/pin/173881235598124187/

Basically, machine learning algorithm fairly evaluates given features. As a result, the correlation is found up to features not logically related to labels.

And here are 4 vital benefits from feature engineering.

  • Improving the accuracy of ML model
  • Solving Overfitting problem
  • Speed up your computation
  • Understandability for ML process

— — —

2. The overall picture for Feature Engineering

In order to acquire above-mentioned benefits, basically, we are trying to reduce the number(dimension) of features. So in machine learning, we have two ways to reduce them.

  • Type A: Feature Selection
  • Type B: Feature Extraction (Dimensional Reduction)

I’m assuming you have already heard about mysterious words like PCA or LDA. Those techniques are inside this concept.

5 Keywords: pre-required concept

Before going deeper you better google some words and put on your mind at least names to understand this feature engineering topic.

  • Correlation
  • Dimensional Reduction
  • PCA
  • LDA
  • SVD

Code: Scikit-learn Module for Feature Engineering

Well, are you ready to start? Don’t worry, coding part of feature engineering is already well simplified by scikit-learn library. (Merci Google!)

# Feature Selection
sklearn.feature_selection
# Feature Extraction (Dimensional Reduction)
sklearn.decomposition

3. Type A: Feature Selection

On the process of machine learning, we have two timing for selecting efficient features for our model. One is before training model, another is while training model. And I will explain must-know two methods for feature selection.

3–1. before training model

  • Statical method: Removing features with low variance
  • Filter method: Univariate feature selection

3–2. while training model

  • Wrapper method: Recursive feature elimination
  • Embedded method: L1-based feature selection

Now I will briefly explain each feature selection techniques with a few lines of code. I recommend reading other official definition of them as well for deep understanding. (and this is not my main concern)

3–1. before training model

I will use Boston dataset for this part. So let’s get started!

from sklearn.datasets import load_boston
boston = load_boston()
# X has 13 features
X, y = boston.data, boston.target

Statical method: Removing features with low variance

https://en.wikipedia.org/wiki/Variance

This means removing features with low variance. Since variations are small, changes in features do not affect our prediction model.

# we cab decide criteria for low variance by "threshold"
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold = (0.8 * (1 - 0.8)))
sel_X = sel.fit_transform(X)
print(X.shape, sel_X.shape)
>>> (506, 13) # Before
>>> (506, 11) # After

Now you can see features are reduced from 13 to 11!

Filter method: Correlation Thresholds feature selection

https://joomik.github.io/Housing/

This means calculating the relationship between each explanatory variable(X) and objective variable(y). Then select only relevant features with the highest certainty factor.

# we can choose the number of features by "k"
from sklearn.feature_selection import SelectKBest, f_regression
sel = SelectKBest(score_func=f_regression, k=7)
sel_X = sel.fit_transform(X, y)
print(X.shape, sel_X.shape)
>>> (506, 13) # Before
>>> (506, 7) # After

3–2. while training model

I will use Iris dataset for this part. So let’s get started!

from sklearn.datasets import load_iris
iris = load_iris()
# X has 4 features
X, y = iris.data, iris.target

Wrapper method: Recursive feature elimination

KDnuggets

While learning, remove features with small parameter weights in order and continue to loop until the feature decreases to the specified n number

# we can choose the number of features by "n_features_to_select"
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
estimator = SVR(kernel="linear")
sel = RFE(estimator, n_features_to_select=2, step=1).fit(X, y)
sel_X = sel.transform(X)
print(sel.ranking_, X.shape, sel_X.shape)
>>> [1 1 1 1 1 6 4 3 2 5] #
>>> (150, 4) # Before
>>> (150, 2) # After

Now you can see features are reduced from 4 to 2!

Embedded method: L1-based feature selection

Alvaro Neuenfeldt Júnior
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
sell = SelectFromModel(lsvc, prefit=True)
sel_X = sel.transform(X)
print(clf.feature_importances_, X.shape, sel_X.shape)
>>> (150, 4) # Before
>>> (150, 2) # After

4. Type B: Feature Extraction

copyright of Akira Takezawa

Here is the list of Feature Extraction Methods:

  • PCA (Principal Component Analysis)
  • LDA (Linear Discriminant Analysis)
  • SVD (Singular Value Decomposition)
  • TSNE (t-Distributed Stochastic Neighbor Embedding)
  • Word2Vec (Natural language processing)

So in this Article, I will Compare PCA and SVD with visualization.

Fortunately, we have all the functions in scikit-learn, sklearn.decomposition module. And I will use iris dataset here.

# load scikit-learn built in dataset "iris"
from sklearn import datasets
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target
print(X_iris)
>>>> output
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
.... # total 150 rows.

PCA (Principal Component Analysis)

  1. Subtract the mean from the data 全ての特徴量を平均で引く
  2. Scale each dimension by its variance 分散で各特徴をスカラー倍
  3. Compute the covariance matrix X 共分散行列Xを計算
  4. Compute K largest eigenvectors of X 最大となる固有値ベクトルの算出

Finally, these eigenvectors are the principal components for Features.

from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit_transform(X_iris)

Cumulative contribution rate indicates how correctly reduced dimension can describe features. The closer to 1 this metric is, it means you are succeeding for dimensional reduction of features.

# Cumulative contribution rate
pca.explained_variance_ratio_
>>> [0.92461872 0.05306648] # new 2 dimentions for features

LDA (Linear Discriminant Analysis) 線形判別分析

from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(2)
svd.fit_transform(X_iris)
copyright of Akira Takezawa

--

--

Akira Takezawa
Coldstart.ml

Data Scientist, Rakuten / a discipline of statistical causal inference and time-series modeling / using Python and Stan, R / MLOps is my current concern