Model evaluation and selection: AI Saturdays Nairobi

Published in

AI Saturdays Nairobi

5 min readSep 16, 2019

When creating models in machine learning the key question is what algorithm should i choose and why? How do i know that it is performing well? What are the key metrics to know that my model is giving me the desired results? These were the main questions we were asking ourselves in AI Saturdays week 6. Six weeks in we have played around with linear regression and k nearest neighbors on sk learn and it was time to look at how to optimize the results we got.

Confusion matrix

So to kick off we looked at different ways to quantify error. The most outstanding was the confusion matrix with an example of a model that classifies people as sick or healthy. The illustration of this can be seen in the table below.

TP — True positive

FP — False positive

TN — True negative

FN — False negative

In the table above one is able to populate the table with the numbers on the different columns. Sick is considered a positive value and healthy a negative value. This helps populate the table on number of true classifications and false ones. This can be used to compute two parameters i.e Precision andRecall.

To determine whether your model needs a high precision or recall score you need to establish what is more important. For example the medical model is a high precision system since it needs to ensure that all the sick people are correctly classified to avoid deaths. Some models prioritize recall.

F1 Score

The F1 score is also another parameter that can be used to score the performance of a model. It combines both precision and recall in a single equation And calculating the harmonic mean.

To be able to bias the score towards precision or recall the factor ß where:

Where ß =0 favors precision

ß= infinity favors recall

ROC Curve

A receiver operating characteristic curve, i.e., ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm.

R1 score

R-squared simply explains how good is your model when compared to the baseline model. It is calculated as:

Where: SSE — the sum of squared error of the model

SST — the sum of squared error of the baseline model

Model Complexity Graph

This graph can help determine the performance of a model nad predict an overfitting and underfitting model.

Overfitting and underfitting

Overfitting: — high bias

Underfitting: — high variance

Bias: It gives us how closeness is our predictive model to training data after averaging predict value. Generally algorithm has high bias which help them to learn fast and easy to understand but are less flexible. That loses it ability to predict complex problem, so it fails to explain the algorithm bias. This results in underfitting of our model.

Variance: It define as deviation of predictions, in simple it is the amount which tell us when its point data value change or a different data is use how much the predicted value will be affected for same model or for different model respectively. Ideally, the predicted value which we predict from model should remain same even changing from one training data-sets to another, but if the model has high variance then model predict value are affect by value of data-sets.

Cross Validation

Cross-validation, sometimes called rotation estimation, or out-of-sample testing is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.

Hyperparameter optimization using Grid search Algorithm

The traditional way of performing hyperparameter optimization has been grid search, or a parameter sweep, which is simply an exhaustive searching through a manually specified subset of the hyperparameter space of a learning algorithm. A grid search algorithm must be guided by some performance metric, typically measured by cross-validation on the training set or evaluation on a held-out validation set. The importance of parameter optimisation is to reduce underfitting and overfitting.

K-fold cross validation

In this method data is divided into k segments and each iteration one of the segments is uses as test data and the rest as training data. This is to mitigate the problem that arises since the set of training and test data determines the accuracy of your model.