#05 Model Application: How to compare and choose the best ML model

モデルの比較と選び方

Akira Takezawa

Published in

Coldstart.ml

7 min readFeb 14, 2019

Hola! Welcome to #ShortcutML Series! ML Cheat Note for everyone!

TL;DR

If you are the haters of a reading article, here I wanna recommend awesome explanation from youtube.

https://www.youtube.com/watch?v=CPqOCI0ahss&list=PLM2zuuevnHbhrh2Y6j-q-fvC6XALwfj1c&index=2&t=0s

1. Logistic Regression

Why Logistic Regression should be the last thing you learn when becoming a Data Scientist

Strong Area:

Linear model
Binary classification

The core idea:

Event Occurs Probability
Odds Ratio

Simplest Code:

from sklearn.linear_model import LogisticRegression
def lr_clf(C):
    model = LogisticRegression(C=C, random_state=0)
    scores = cross_val_score(model, X, y, cv=5)
    return scores.mean()
print(lr_clf(C=0.1))>>> 0.838778... # average of accuracy

Main Hyperparameters:

{C: 0.0001, 10000}
{solver: newton-cg, lbfgs, liblinear, sag, saga}
{penalty: l1. l2}

Remarks:

Logistic regression is named for the function used at the core of the method, the logistic function.

The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment.

2. SVM

Figure — available via license: CC BY 3.0

Strong Area(distribution):

Complex Non-linear classification
Multi-Class classification

The core idea:

Kernel Methods
Margin Maximization
Hard Margin vs Soft Margin by C

Simplest Code:

from sklearn.svm import SVC
def svm_clf(C, gamma):
    model = SVC(C=C, gamma=gamma, random_state=0)
    scores = cross_val_score(model, X, y, cv=5)
    return scores.mean()
print(svm_clf(C=0.1, gamma=0.001))>>> 0.864577... # average of accuracy

Main Hyperparameters:

{kernel: rbf, linear} =
{C: 0.0001, 10000} = Regularization: sensitivity for miss-classification
{gamma: 0.0001, 10000} =

Remarks:

A non-linear method of classifying by regression. By adopting margin maximization, it realizes a two-class regression model with high generalization performance even with fewer data. However, the learning time becomes longer.

3. Naive Bayes

Strong Area:

Text data
Word-based classification

Core idea:

Bayes Theorem
Conditional Probability
Human-like Estimation

Simplest Code:

from sklearn.naive_bayes import GaussianNB
def gnb_clf():
    model = GaussianNB()
    scores = cross_val_score(model, X, y, cv=5)
    return scores.mean()
print(gnb_clf())>>> 0.858212... # average of accuracy

Main Hyperparameters:

Remarks:

Applying Bayesian theorem to predict the highest probability of the classification. It’s based on the assumption that each feature affects the object independently.

Gaussian Naive Bayes only can use when feature data have normal distributions. In addition, features should be independent of others. Otherwise, strong bias added on one particular feature. So don’t use this for complicated data.

4. Decision Tree

Decision Tree in Python, with Graphviz to Visualize

Strong Area:

Complex Non-linear classification
Classification
Regression

The core idea:

Entropy: tells us which features are important for target value
Gini index
Cross-Entropy

Simplest Code:

from sklearn.tree import DecisionTreeClassifier
def tree_clf(max_depth):
    model = DecisionTreeClassifier(max_depth=max_depth)
    scores = cross_val_score(model, X, y, cv=5)
    return scores.mean()
print(tree_clf(max_depth=5))>>> 0.864577... # average of accuracy

Main Hyperparameters:

max_depths
min_samples_split
min_samples_leaf
max_features

Remarks:

A nonlinear model to classify data by dividing the data into two by one explanatory variable and its threshold from the top. The selection and threshold of the explanatory variables are determined using criteria such as Gini non-purity and entropy.

Decision Tree tends to fall in overfitting so easily, therefore to prevent it, we need to keep the maximum depth of the tree shallow. And also it takes more time for processing.

5. Bagging: Random Forest

Strong Area:

Complex Non-linear classification
Continuous values (in case of regression trees)

Core idea:

Ensemble Learning
Bagging (parallel)
Weak learner and Strong learner

Simplest Code:

from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=1000)
scores = cross_val_score(forest, X, y, cv=10)
print(scores.mean())>>> 0.93286 # Accuracy is improved!

Main Hyperparameters:

n_estimators = number of trees
max_features = max number of features considered for splitting a node
max_depth = max number of levels in each decision tree
min_samples_split = min number of data points placed in a node before the node is split
min_samples_leaf = min number of data points allowed in a leaf node
bootstrap = method for sampling data points (with or without replacement)

Remarks:

Random forest is popular because Both of the generalization performance and the parallelism of the processing is high. It also handles outliers and non-linear data, unbalanced data very well.

6. Boosting: Gradient Boosting Tree

http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html

Strong Area:

Continuous values (in case of regression trees)
Complex Non-linear classification

Core idea:

Ensemble Learning
Boosting (weighted or hierarchical)
Error rate
Delta

Simplest Code:

from xgboost import XGBClassifier
def xgb_clf(n_estimators):
    model = XGBClassifier(n_estimators=n_estimators)
    scores = cross_val_score(model, X, y, cv=5,)
    return scores.mean()
print(xgb_clf(100))>>> 0.94286 # Accuracy is improved!

Main Hyperparameters:

n_estimators = number of trees
min_child_weight
max_depth
gamma

Remarks:

Boosting is a method for improving accuracy using multiple weak learning units. By gradually increasing the weak learning unit, the prediction accuracy is gradually improved. However, it is important to stop at the appropriate timing because it occurs at the same time. Since parallel processing is impossible, computation time is also liable to take place.

7. K-Nearest Neighbor(KNN)

https://japaneseclass.jp/trends/about/KNN

Strong Area:

Multi-class Classification Problem

Core idea:

Majority Vote

Simplest Code:

from sklearn.neighbors import KNeighborsClassifier
k = 10
knn = KNeighborsClassifier(n_neighbors = k)
knn = knn.fit_transform(X, y)
knn.predict(future_data)>>> proba {classA: 0.742, classB: 0.485 } # future_data is classA!

Main Hyperparameters:

n_neighbors

Remarks:

KNN is called one of the laziest algorithm and the concept is very simple. The process breakdown in 5 steps:

Mapping all training data in N dimensions
Put your test data in 1. as well
Decide K value which is the distance of a range
Count the number of each class in K distance range
Decide class for test data by using a majority vote

The code for KNN is pretty much simplified by scikit-learn:Then one question should come up with your mind. Yes, the only issue in KNN is:

How can we find suitable K?

Here is the answer. In general, the formula to decide suitable K is a square root of sum counts of sampling data. For example, if you have 100 sampling data, best K should be 10 (square root of 100!).

However, here I have a more careful method to find the best K for your task. The code is below:

k_range = range(1,100)
k_score = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k, metric="minkowski")
    scores = cross_val_score(knn, X, y, cv=10, scoring = "accuracy")
    k_score.append(scores.mean())
best_k = np.argmax(k_score) + 1 # index starts from 0
print(best_k)>>> 13

In this code, I just try to fit all possible K and measure accuracy. After that, I pick up K with the best accuracy. I visualize the change of accuracy depends on K value below:

plt.plot(k_range, k_score)

Finally, we fit the best K for our KNN algorithm:

knn = KNeighborsClassifier(n_neighbors=13)
scores=cross_val_score(knn, X, y , cv=10, scoring="accuracy")
scores.mean()>>> 0.980001 # over 95%! good accuracy!

9. Neural Network (MLPClassifier)

Core idea:

Strong Area:

Simplest Code:

from xgboost import XGBClassifier
def xgb_clf(n_estimators):
    model = XGBClassifier(n_estimators=n_estimators)
    scores = cross_val_score(model, X, y, cv=5,)
    return scores.mean()
print(xgb_clf(100))>>> 0.94286 # Accuracy is improved!

Main Hyperparameters:

Remarks:

— — — — —

References

Statistical Significance Tests for Comparing Machine Learning Algorithms

Comparing machine learning methods and selecting a final model is a common operation in applied machine learning…

machinelearningmastery.com

How To Compare Machine Learning Algorithms in Python with scikit-learn

It is important to compare the performance of multiple different machine learning algorithms consistently. In this post…

machinelearningmastery.com

Comparing Different Classification Machine Learning Models for an imbalanced dataset

A data set is called imbalanced if it contains many more samples from one class than from the rest of the classes. Data…

towardsdatascience.com

Comparing Machine Learning Models: Statistical vs. Practical Significance

Is model A or B more accurate? Hmm…

towardsdatascience.com

Machine Learning for hackers: model comparison and selection - Coding Startups

As technological entrepreneurs, machine learning is all over us. The science of training machines to learn and produce…

codingstartups.com

#05 Model Application: How to compare and choose the best ML model

モデルの比較と選び方

TL;DR

Menu

1. Logistic Regression

2. SVM

3. Naive Bayes

4. Decision Tree

5. Bagging: Random Forest

6. Boosting: Gradient Boosting Tree

Core idea:

7. K-Nearest Neighbor(KNN)

9. Neural Network (MLPClassifier)

References

Statistical Significance Tests for Comparing Machine Learning Algorithms

Comparing machine learning methods and selecting a final model is a common operation in applied machine learning…

How To Compare Machine Learning Algorithms in Python with scikit-learn

It is important to compare the performance of multiple different machine learning algorithms consistently. In this post…

Comparing Different Classification Machine Learning Models for an imbalanced dataset

A data set is called imbalanced if it contains many more samples from one class than from the rest of the classes. Data…

Comparing Machine Learning Models: Statistical vs. Practical Significance

Is model A or B more accurate? Hmm…

Machine Learning for hackers: model comparison and selection - Coding Startups

As technological entrepreneurs, machine learning is all over us. The science of training machines to learn and produce…

Comparing Various ML models(ROC curve comparison) | Kaggle

Edit description

Written by Akira Takezawa