#06 Model Validation: Practical Metric Lists only you need to know

モデルの評価指標まとめ

Akira Takezawa
Coldstart.ml
7 min readMar 30, 2019

--

Photo by rawpixel on Unsplash

Hola! Welcome to #ShortcutML Series! Cheat Note for everyone!

Target is who wanna know …

  • Reason: Model Applying isn’t the end of ML
  • Big Picture: Summarize too many validation metrics
  • Code: The simplest python code for each preprocessing

— — —

Why you have to read this?

Choosing the Right Metric for Evaluating Machine Learning Models — Part 1, Alvira Swalin

Before we jump on the main topic, when do we evaluate our model? The answer is not only once. Generally, we use model validation metrics twice in our real Data Science workflow:

  1. Model Comparison: select the best ML model for your task
  2. Model Improvement: with tuning hyperparameters

To get a more clear picture of the difference between these two, let me explain by the workflow of ML implementation. So after you set all the features X for your task y, you might prepare multiple ML models as candidates.

Then how can you finally choose one for your task? Yes, this is the first point when you use model validation metrics. Scikit-learn provides some shortcut methods to compare models like cross_validation.

Next, after you choose one particular model with the best accuracy, you will jump on Hyperparameter Tuning part to improve more accuracy and versatility. Here is the second point you’ll use these metrics.

In this article, I’m trying to make a Cheat Note of Model Evaluation Metrics. So let’s get started!

— — —

Menu

  1. Cross-Validation
  2. Metrics for Regression problem
  3. Metrics for the Classification problem
  4. Metrics for Clustering problem
  5. Additional: Learning Curve Visualization

1. Cross-Validation for model comparison

Visual Representation of Train/Test Split and Cross Validation. H/t to my DSI instructor, Joseph Nelson

The starting point of this why and how we split data is Generalization. Because our goal of building a machine learning model is real implementation with unknown data from the future. So we don’t need useless models which are overfitting with past data.

Therefore the biggest difference between these two methods is the way of handling “training data”. One is fix training data, but another is randomly and diversely picking up training data to create a more generalized model.

1. Holdout Method

2. Cross-Validation Method

Visual representation of K-Folds. Again, H/t to Joseph Nelson

2–1. cross_val_score: simplest coding method

We can decide the number of data splitting by a parameter “cv”. Normally 5 is considered as a standard splitting number.

2–1. cross_validate: I recommend this customizable one

2. Metrics for Regression

TL;DR: In most case, we use R2 or RMSE.

answered Apr 18 ’15 at 12:00 Jean-Paul

I’ll use Boston House Price dataset.

Model 1: Linear Regression

Model 2: Decision Tree Regressor

Now we are ready to evaluate our two models and chose one!

1. R2: Coefficient of Determination

when to use:

2. MSE: Mean Square Error

when to use:

3. RMSE: Root Mean Square Error

when to use:

4. MAE: Mean Absolute Error

when to use:

3. Metrics for Classification

https://sklearn.org/modules/svm.html

Overall Picture for a classification problem

  1. One vs. One Classification: e.g. Paid user or free
  2. One vs. Rest Classification: e.g. Premium member or Paid or free

I’ll use Iris dataset as a multi-class classification problem.

Model 1: SVM

Model 2: Naive Bayes

Now we are ready to evaluate our two models and chose one!

1. Accuracy:

when to use:

2. Precision:

when to use:

3. Recall or Sensitivity:

when to use:

4. F Score:

when to use:

5. Confusion Matrix

when to use:

scikit-learn official document

6. ROC: Receiver Operating Characteristic Curve

If you don’t use OneVsRestClassifier, it doesn’t work…

Now we will check by ROC Curve.

when to use:

7. AUC: Area Under Curve

when to use:

8. Multi-class logarithmic loss

It’s a probability. And need to use OneVsRestClassifier.

when to use:

4. Metrics for Clustering

Amazon SageMaker supports kNN classification and regression

Basically in real clustering task, (I mean unsupervised clustering), we don’t have any method to measure accuracy or precision because nobody knows.

However as a process of classification task, sometimes we use supervised clustering to know the character of data. (In a real job as well.)

So I’ll quickly introduce some metrics for supervised clustering, in order to let you know just their existence (priority is very low though).

OK, I used only features from Iris dataset for a clustering problem.

As a representative model for a clustering problem, This time I used K-means.

Now the result of supervised clustering is In y_means.

© 2019 akira takezawa

1. Homogeneity score, Completeness Score, V-measure Score

5. Additional: Learning Curve Visualization

--

--

Akira Takezawa
Coldstart.ml

Data Scientist, Rakuten / a discipline of statistical causal inference and time-series modeling / using Python and Stan, R / MLOps is my current concern