How To Evaluate Your Machine Learning Models? — Classification Evaluation Metrics

Grace Zhang
8 min readNov 26, 2018

--

Choosing the right evaluation metrics for your machine learning models is important. Why? It helps to evaluate how your model performs and to select better parameters.

Although you could simply try all the different metrics available, it is a waste of time and you will be confused by the different outputs from these metrics. Today, I will mainly focus on different evaluation metrics for classification models, pros and cons of these metrics, and how to choose the appropriate metrics for your machine learning models.

Classification Evaluation Metrics Comparison

Classical Accuracy

Accuracy is probably the very first evaluation measurement you could think of. It measures how many predictions are correct over all the predictions. It is the ratio of the number of correct predictions over the total number of predictions.

Accuracy is commonly used because it is simple and straightforward. However, it might not be a great choice most of the time because of two reasons. First of all, it only gives you a single number without telling you what types of errors your model is making. Secondly, it is significantly affected by imbalanced classes. Here is an example on the second point:

Imagine you have a testing dataset with 10 observations. 9 out of 10 are class A. If we choose to always predict the most frequent class, when measure the model performance using accuracy, we will achieve an accuracy of 90%. However, this does not mean that our algorithm is the optimal one. In the real world, often times the data are imbalanced. Thus the classical accuracy is not that useful and we need to introduce more metrics.

Logarithm Loss

Another measurement for your machine learning models would be loss functions. Generally speaking, loss functions represent the price paid for inaccuracy of predictions. For a classification problem, the logarithm loss function is the most frequently used one.

Logarithm loss quantifies the accuracy of a classifier by penalizing false classifications. It can be used if the raw output of the classifier is a probability vector.

Log Loss Function

Above is the log loss function for a multi-class classification model, where N is the number of samples, M is the number of classes, yij is a binary indicator of whether label j is the correct classification for instance i, and pij is the model probability of assigning label j to instance i.

One good thing about the log loss function is that it penalize heavily for being confident about the prediction while wrong. The below graph shows that for a binary classification model, when the true label is 1, as the predicted probability for label = 1 tends to 0, the log loss function tends to infinity.

Another advantage of the log loss function is that it is convex and can be globally minimized using stochastic gradient descent methods.

A side note is that log loss is often used in training evaluations.

Before we dive into ROC-AUC and F score, there are a few terms I would like to introduce.

  1. True Positive Rate (TPR): correctly predict a positive outcome out of all true positive results. Can also be written as TP/(TP+FN). It’s also called recall/sensitivity.
  2. False Positive Rate (FPR): predicting a negative outcome to be positive out of total true negative results. Can also be written as FP/(FP+TN). Specificity is measured by 1-FPR.
  3. True Negative Rate (TNR): correctly predict a negative outcome out of all true negative results. Also can be written as TN/(TN+FP).
  4. False Negative Rate (FNR): predicting a positive outcome to be negative out of total true positive results. Can also be written as FN/(TP+FN).
  5. Precision: the proportion of true positives out of all observations that are classified as positive. It can be written as TP/(TP+FP).

ROC-AUC

ROC stands for receiver operating characteristics. An ROC plot is created by plotting the true positive rate (TPR) against the false positive rate(FPR) at various threshold settings. “It shows how many true positive classifications can be gained as you allow for more false positives”. It is most commonly used to visualize the performance of a binary classifier.

An ideal scenario will be when we have a perfect classification, i.e. we predict zero false positives with a true positive rate of 1. If you take a random guess of the outcome, it would fall on the diagonal line from the left bottom to the top right corner of the ROC. Hence, points fall above the diagonal line would be good classification results as they are better than randomly guessing and points below the diagonal line is worse than random.

One advantage of using ROC is that it visualizes all possible classification threshold (FPR) for a certain classifier thus provides nuanced details about the behavior of the classifier.

However, it is not that straightforward to compare one ROC against another. This is when AUC comes to play. AUC stands for areas under curve and it measures the areas under the ROC curve. It quantifies the performance of a classifier to a single number and bounds it between 0 and 1. The closer the AUC is to 1, the better performance the classifier has.

ROC-AUC can also be extended to multi-class classification problems. All you have to do is to adopt the “one versus all” approach, taking one class out and compare it against all the other classes.

An advantage of ROC-AUC is that it is not sensitive to imbalanced classes. If you think of AUC as a measurement of the probability of ranking a random positive observation over a random negative observation, the original distribution of the data does not really matter here.

F Score

F Score (also called F Beta Score) is a combined measurement of recall and precision. Below is a generalized formula of the F score. It takes both recall and precision into consideration when evaluating the models.

F-1 Score is a special case of F Beta Score when β = 1. It is the harmonic mean of precision and recall. The F-1 score tends toward the smaller of the precision and recall (because the harmonic mean is always less than or equal to the arithmetic mean). Hence the F-1 score will be small if either precision or recall is small.

Depending on whether you value precision or recall more, you can choose your β accordingly. If you weighs recall higher than precision, choose a β that is greater than 1. Alternatively, if you weighs precision higher than recall, choose a β less than 1. When would you prefer one over the other? For situations such as disease recognition, you would definitely value recall over precision. While precision will be more critical when the cost of false positive is high. For instance, in spam detection, you want to ensure not to falsely classify an important email as spam so that users won’t lose valuable information. Hence, F score is especially useful for problems with unequal costs and benefits.

Confusion Matrix

We’ve discussed why accuracy is not preferred. A confusion matrix provides a clear breakdown of the correct and incorrect classifications for each class. The main diagonal of a confusion matrix contains the counts of right classifications and the errors of the classifier are the false positives and false negatives.

Interestingly, if you look closely at the confusion matrix, you will notice that all the other metrics we talked about in this article can be derived from a confusion matrix. Therefore, a confusion matrix would be a great start point if you want to evaluate how your model performs.

From wikipedia

What’s next?

Now that we have some evaluation metrics in mind, it is time to consider what would be a reasonable baseline against which to compare model performance. It definitely varies by different use cases. However, there are some general guidelines that we could follow.

I am reading the book Data Science for Business now and the related paragraphs in Chapter 7 inspired me a lot. Here are some suggestions given by the author.

One good baseline is the majority classifier, a naive classifier that always chooses the majority class of the training dataset.

In some applications there are multiple simple averages that one may want to combine.

A slightly more complex alternative is a model that only considers a very small amount of feature information.

Beyond comparing simple models, it is often useful to implement simple, inexpensive models based on domain knowledge or “received wisdom” and evaluate their performance.

To conclude, it is critical to pick the right evaluation metrics. This requires careful thoughts and it depends on a lot of different factors, the distribution/characteristics of your data, your business objectives…Hope this article provides some help when you evaluate your model next time!

I hope you enjoy this article. As always, please let me know if you have any questions, comments, suggestions, etc. Thanks for reading :)

About Me

I am a master student in Data Science at University of San Francisco. I am most passionate about Machine Learning. I enjoy hiking in my spare time. You can also find me through Linkedin.

--

--