#04 Feature Engineering: Principles for choosing right features
特徴量生成の大原則
Hola! Welcome to “Short-Cut Machine Learning Series”.
Target is who wanna know …
- Reason: why is Engineering Features so so important?
- Big Picture: must-know skills for Feature Selection and Extraction
- Code: simplest python code ever
— — —
Why you have to read this?
As a machine learning Engineer, you will be required to show your skill particularly in following 3 steps below:
- Feature Engineering: select and extract features for feeding model
- Model Selection: compare and choose the best ML model
- Generalization: adjust your hyperparameter
So today’s topic is one of the most important topic, “Feature Engineering”. Let’s get started!
— — —
Menu
- Why is Feature Engineering always necessary for ML?
- The overall picture for Feature Engineering
- Type A: Feature Selection
- Type B: Feature Extraction (Dimensional Reduction)
- References
1. Why is Engineering Features necessary for ML?
Basically, machine learning algorithm fairly evaluates given features. As a result, the correlation is found up to features not logically related to labels.
And here are 4 vital benefits from feature engineering.
- Improving the accuracy of ML model
- Solving Overfitting problem
- Speed up your computation
- Understandability for ML process
— — —
2. The overall picture for Feature Engineering
In order to acquire above-mentioned benefits, basically, we are trying to reduce the number(dimension) of features. So in machine learning, we have two ways to reduce them.
- Type A: Feature Selection
- Type B: Feature Extraction (Dimensional Reduction)
I’m assuming you have already heard about mysterious words like PCA or LDA. Those techniques are inside this concept.
5 Keywords: pre-required concept
Before going deeper you better google some words and put on your mind at least names to understand this feature engineering topic.
- Correlation
- Dimensional Reduction
- PCA
- LDA
- SVD
Code: Scikit-learn Module for Feature Engineering
Well, are you ready to start? Don’t worry, coding part of feature engineering is already well simplified by scikit-learn library. (Merci Google!)
# Feature Selection
sklearn.feature_selection
# Feature Extraction (Dimensional Reduction)
sklearn.decomposition
3. Type A: Feature Selection
On the process of machine learning, we have two timing for selecting efficient features for our model. One is before training model, another is while training model. And I will explain must-know two methods for feature selection.
3–1. before training model
- Statical method: Removing features with low variance
- Filter method: Univariate feature selection
3–2. while training model
- Wrapper method: Recursive feature elimination
- Embedded method: L1-based feature selection
Now I will briefly explain each feature selection techniques with a few lines of code. I recommend reading other official definition of them as well for deep understanding. (and this is not my main concern)
3–1. before training model
I will use Boston dataset for this part. So let’s get started!
from sklearn.datasets import load_boston
boston = load_boston()
# X has 13 features
X, y = boston.data, boston.target
Statical method: Removing features with low variance
This means removing features with low variance. Since variations are small, changes in features do not affect our prediction model.
# we cab decide criteria for low variance by "threshold"
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold = (0.8 * (1 - 0.8)))
sel_X = sel.fit_transform(X)
print(X.shape, sel_X.shape)>>> (506, 13) # Before
>>> (506, 11) # After
Now you can see features are reduced from 13 to 11!
Filter method: Correlation Thresholds feature selection
This means calculating the relationship between each explanatory variable(X) and objective variable(y). Then select only relevant features with the highest certainty factor.
# we can choose the number of features by "k"
from sklearn.feature_selection import SelectKBest, f_regression
sel = SelectKBest(score_func=f_regression, k=7)
sel_X = sel.fit_transform(X, y)
print(X.shape, sel_X.shape)>>> (506, 13) # Before
>>> (506, 7) # After
3–2. while training model
I will use Iris dataset for this part. So let’s get started!
from sklearn.datasets import load_iris
iris = load_iris()
# X has 4 features
X, y = iris.data, iris.target
Wrapper method: Recursive feature elimination
While learning, remove features with small parameter weights in order and continue to loop until the feature decreases to the specified n number
# we can choose the number of features by "n_features_to_select"
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
estimator = SVR(kernel="linear")
sel = RFE(estimator, n_features_to_select=2, step=1).fit(X, y)
sel_X = sel.transform(X)
print(sel.ranking_, X.shape, sel_X.shape)>>> [1 1 1 1 1 6 4 3 2 5] #
>>> (150, 4) # Before
>>> (150, 2) # After
Now you can see features are reduced from 4 to 2!
Embedded method: L1-based feature selection
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
sell = SelectFromModel(lsvc, prefit=True)
sel_X = sel.transform(X)
print(clf.feature_importances_, X.shape, sel_X.shape)>>> (150, 4) # Before
>>> (150, 2) # After
4. Type B: Feature Extraction
Here is the list of Feature Extraction Methods:
- PCA (Principal Component Analysis)
- LDA (Linear Discriminant Analysis)
- SVD (Singular Value Decomposition)
- TSNE (t-Distributed Stochastic Neighbor Embedding)
- Word2Vec (Natural language processing)
So in this Article, I will Compare PCA and SVD with visualization.
Fortunately, we have all the functions in scikit-learn, sklearn.decomposition module. And I will use iris dataset here.
# load scikit-learn built in dataset "iris"
from sklearn import datasets
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target
print(X_iris)
>>>> output
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
.... # total 150 rows.
PCA (Principal Component Analysis)
- Subtract the mean from the data 全ての特徴量を平均で引く
- Scale each dimension by its variance 分散で各特徴をスカラー倍
- Compute the covariance matrix X 共分散行列Xを計算
- Compute K largest eigenvectors of X 最大となる固有値ベクトルの算出
Finally, these eigenvectors are the principal components for Features.
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit_transform(X_iris)
Cumulative contribution rate indicates how correctly reduced dimension can describe features. The closer to 1 this metric is, it means you are succeeding for dimensional reduction of features.
# Cumulative contribution rate
pca.explained_variance_ratio_
>>> [0.92461872 0.05306648] # new 2 dimentions for features
LDA (Linear Discriminant Analysis) 線形判別分析
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(2)
svd.fit_transform(X_iris)
How to use differently PCA and SVD?
- PCA is for …
- Irregularly distributed data like Normal distribution
- Unsupervised: can be applied Classification and Clustering problem
2. LDA is for …
- Uniformly distributed data
— — —