#05 Model Application: How to compare and choose the best ML model

モデルの比較と選び方

Akira Takezawa
Coldstart.ml
7 min readFeb 14, 2019

--

Hola! Welcome to #ShortcutML Series! ML Cheat Note for everyone!

TL;DR

If you are the haters of a reading article, here I wanna recommend awesome explanation from youtube.

https://www.youtube.com/watch?v=CPqOCI0ahss&list=PLM2zuuevnHbhrh2Y6j-q-fvC6XALwfj1c&index=2&t=0s

Menu

  1. Logistic Regression
  2. SVM
  3. Naive Bayes
  4. Decision Tree
  5. Random Forest
  6. Gradient Boosting Tree (XGBoost)
  7. K-nearest neighbor algorithm (KNN)
  8. Neural Network (MLPClassifier)

Premise

  • Each Machine Learning Model has a shape of the formula

1. Logistic Regression

Why Logistic Regression should be the last thing you learn when becoming a Data Scientist

Strong Area:

  • Linear model
  • Binary classification

The core idea:

  • Event Occurs Probability
  • Odds Ratio

Simplest Code:

Main Hyperparameters:

  • {C: 0.0001, 10000}
  • {solver: newton-cg, lbfgs, liblinear, sag, saga}
  • {penalty: l1. l2}

Remarks:

Logistic regression is named for the function used at the core of the method, the logistic function.

The logistic function, also called the sigmoid function was developed by statisticians to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment.

2. SVM

Figure — available via license: CC BY 3.0

Strong Area(distribution):

  • Complex Non-linear classification
  • Multi-Class classification

The core idea:

  • Kernel Methods
  • Margin Maximization
  • Hard Margin vs Soft Margin by C

Simplest Code:

Main Hyperparameters:

  • {kernel: rbf, linear} =
  • {C: 0.0001, 10000} = Regularization: sensitivity for miss-classification
  • {gamma: 0.0001, 10000} =

Remarks:

A non-linear method of classifying by regression. By adopting margin maximization, it realizes a two-class regression model with high generalization performance even with fewer data. However, the learning time becomes longer.

3. Naive Bayes

Naive Bayes Theorem

Strong Area:

  • Text data
  • Word-based classification

Core idea:

  • Bayes Theorem
  • Conditional Probability
  • Human-like Estimation

Simplest Code:

Main Hyperparameters:

Remarks:

Applying Bayesian theorem to predict the highest probability of the classification. It’s based on the assumption that each feature affects the object independently.

Gaussian Naive Bayes only can use when feature data have normal distributions. In addition, features should be independent of others. Otherwise, strong bias added on one particular feature. So don’t use this for complicated data.

4. Decision Tree

Decision Tree in Python, with Graphviz to Visualize

Strong Area:

  • Complex Non-linear classification
  • Classification
  • Regression

The core idea:

  • Entropy: tells us which features are important for target value
  • Gini index
  • Cross-Entropy

Simplest Code:

Main Hyperparameters:

  • max_depths
  • min_samples_split
  • min_samples_leaf
  • max_features

Remarks:

A nonlinear model to classify data by dividing the data into two by one explanatory variable and its threshold from the top. The selection and threshold of the explanatory variables are determined using criteria such as Gini non-purity and entropy.

Decision Tree tends to fall in overfitting so easily, therefore to prevent it, we need to keep the maximum depth of the tree shallow. And also it takes more time for processing.

5. Bagging: Random Forest

APPLYING RANDOM FOREST (CLASSIFICATION)

Strong Area:

  • Complex Non-linear classification
  • Continuous values (in case of regression trees)

Core idea:

  • Ensemble Learning
  • Bagging (parallel)
  • Weak learner and Strong learner

Simplest Code:

Main Hyperparameters:

  • n_estimators = number of trees
  • max_features = max number of features considered for splitting a node
  • max_depth = max number of levels in each decision tree
  • min_samples_split = min number of data points placed in a node before the node is split
  • min_samples_leaf = min number of data points allowed in a leaf node
  • bootstrap = method for sampling data points (with or without replacement)

Remarks:

Random forest is popular because Both of the generalization performance and the parallelism of the processing is high. It also handles outliers and non-linear data, unbalanced data very well.

6. Boosting: Gradient Boosting Tree

http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html

Strong Area:

  • Continuous values (in case of regression trees)
  • Complex Non-linear classification

Core idea:

  • Ensemble Learning
  • Boosting (weighted or hierarchical)
  • Error rate
  • Delta

Simplest Code:

Main Hyperparameters:

  • n_estimators = number of trees
  • min_child_weight
  • max_depth
  • gamma

Remarks:

Boosting is a method for improving accuracy using multiple weak learning units. By gradually increasing the weak learning unit, the prediction accuracy is gradually improved. However, it is important to stop at the appropriate timing because it occurs at the same time. Since parallel processing is impossible, computation time is also liable to take place.

7. K-Nearest Neighbor(KNN)

https://japaneseclass.jp/trends/about/KNN

Strong Area:

  • Multi-class Classification Problem

Core idea:

  • Majority Vote

Simplest Code:

Main Hyperparameters:

  • n_neighbors

Remarks:

KNN is called one of the laziest algorithm and the concept is very simple. The process breakdown in 5 steps:

  1. Mapping all training data in N dimensions
  2. Put your test data in 1. as well
  3. Decide K value which is the distance of a range
  4. Count the number of each class in K distance range
  5. Decide class for test data by using a majority vote

The code for KNN is pretty much simplified by scikit-learn:Then one question should come up with your mind. Yes, the only issue in KNN is:

How can we find suitable K?

Here is the answer. In general, the formula to decide suitable K is a square root of sum counts of sampling data. For example, if you have 100 sampling data, best K should be 10 (square root of 100!).

However, here I have a more careful method to find the best K for your task. The code is below:

In this code, I just try to fit all possible K and measure accuracy. After that, I pick up K with the best accuracy. I visualize the change of accuracy depends on K value below:

KNN accuracy by K

Finally, we fit the best K for our KNN algorithm:

--

--

Akira Takezawa
Coldstart.ml

Data Scientist, Rakuten / a discipline of statistical causal inference and time-series modeling / using Python and Stan, R / MLOps is my current concern