Performance Evaluation Metrics

Aryan Lala
Analytics Vidhya
Published in
7 min readApr 2, 2021
Photo by Markus Winkler on Unsplash
Photo by Vitolda Klein on Unsplash

After building a machine learning model we should understand how good it is performing. For evaluating the performance of our model, we use Performance Evaluation Metrics. Most people, when they start with machine learning, make the mistake of directly mapping their model’s results without evaluating its performance first. These metrics help us to change the weights of our model to get best results.

The choice of performance metrics is very crucial, as it affects how the performance of our model is measured and compared with benchmark results. This choice depends on the type of model and its application. Now let’s see the different type of metrics.

Contents:

  1. Confusion Matrix
  2. Accuracy
  3. Precision
  4. Recall
  5. Specificity
  6. F1-Score
  7. Logarithmic Loss
  8. AUC — ROC Curve
  9. Mean Squared Error
  10. Mean Absolute Error
  11. Root Mean Square Error
  12. Gini coefficient
  13. Cohen’s Kappa Coefficient

1. Confusion Matrix: This metrics is used for classification problems and is considered the easiest metrics for performance evaluation. Confusion matrix is a simple table having two dimensions: ‘actual’ and ‘predicted’, and for both the dimensions it has same sets of classes. In a simple classification problem, there are two classes: positive (1) and negative (0).

Consider the example of tumor classification problem, here there are two classes: The person has tumor (y = 1), and The person does not have tumor (y = 0).

Confusion Matrix

There are four quadrants in the confusion matrix and they represent the following:

  • True Positive (TP): The number of cases that were positive (+) and correctly classified as positive (+).

Example: The person is actually having cancer and the model correctly predicted that the person has cancer.

  • False Positive (FP): The number of occurrences that were negative (-) and incorrectly classified as (+). This also known as Type 1 Error.

Example: The person is not having cancer and the model incorrectly predicted that the person has cancer.

  • False Negative (FN): The number of occurrences that were positive (+) and incorrectly classified as negative (-). It is also known as Type 2 Error.

Example: The person is actually having cancer and the model incorrectly predicted that the person does not have cancer.

  • True Negative (TN): The number of cases that were negative (-) and correctly classified as (-).

Example: The person is not having cancer and the model correctly predicted that the person does not have cancer.

For a perfect classifier there should be no error that is, Type 1 or Type 2 errors should be equal to zero therefore, False Positives = False Negatives = 0. So, to improve the model’s performance we should minimize these errors, False Positives and False negatives. Confusion matrix forms the base for other performance metrics.

2. Accuracy: Accuracy is the measure of correct predictions made by our model. It is equal to the number of correct predictions made upon total number of predictions made by the model.

Equation for Accuracy

Accuracy gives correct results when the classes are roughly of equal size. It should not be considered when there is a majority in the classes. For example, if there are total 100 people in the dataset out of which 94 of them do not have cancer and 6 have cancer. If our model predicts all of them not having cancer, the accuracy would be 94% despite the fact that the model does not correctly predict cancer patients. Therefore, precision and recall are preferred instead of accuracy.

3. Precision: It is the ratio of true positive samples to all samples classified as positive. It is also known as Positive Predictive Value (PPV).

In our example precision denotes of all the patients, where we predicted they have cancer (y =1), what fraction actually has cancer.

4. Recall: It is the ratio of true positive samples to all samples that are actually positive. It is also called True Positive Rate (TPR) or sensitivity.

In our example recall denotes of all the patients that actually have cancer, what fraction did the model correctly detect as having cancer.

5. Specificity: It is the fraction of negative samples correctly classified by the model. It is also known as True Negative Rate (TNR) or specificity.

In our example specificity denotes of all the patients that do not have cancer, what fraction did the model correctly detect as patients not having cancer.

6. F1-Score: It is a combination of precision and recall and that is why it is a popular performance metric. F1-score is equal to the harmonic mean of precision (p) and recall (r).

Like other metrics F1-score best value is 1 and worst value is 0. Since F1-score is a combination of precision and recall it finds balance between both therefore, high value of F1-score confirms high values of precision and recall.

7. Logarithmic Loss: In Kaggle competitions this metrics is widely used. It is also known as Log Loss or Cross-Entropy Loss. Log Loss penalizes false classifications and thus, quantifying the accuracy of the classifier.

Here p is the probability of the positive class (y=1) and N is the number of samples in the dataset.

Performance of the model is measured having prediction output as probability value between 0 and 1. Since this metric is a loss, it should be minimum and thus, an ideal model would have log loss of 0.

8. AUC — ROC Curve: 1. Area Under the Curve (AUC) Receiver Operating Characteristic (ROC) Curve helps us visualize the performance of our model. It is an important evaluation metric.

The ROC curve is plotted having TPR on the y-axis and FPR on the x-axis that is, TPR against FPR graph. Both the axes have values from 0 to 1.

The ideal model is at point (TPR = 1, FPR = 0) and the worst one is at (TPR = 0, FPR = 1). Thus, for a better model TPR should be high and FPR should be low.

ROC Curve — Wikipedia

To find the performance of the ROC curve we calculate AUC. Area under the curve can also be used to compare two or more classifiers. A better performing model would have greater AUC than the other ones. Also, for a random guessing classifier ROC is a straight line from (0,0) to (1,1) thus, AUC equal to 0.5 thus, a good classifier should at least have AUC greater than 0.5

9. Mean Squared Error: Mean Squared Error (MSE) is equal to the average squared difference (distance/error) between the predicted and target (actual) values. For a better model MSE should be minimum.

MSE is very popular metric used in regression problems. The squaring of the error is beneficial as it always gives positive value, so the sum would not be zero. Also, it highlights the larger differences — which can be both good and bad (it ensures that our trained model does not contain outliers whereas a single outlier magnifies the error).

10. Mean Absolute Error: Mean Absolute Error (MAE) is equal to the average absolute difference (distance) between the predicted and target (actual) values.

MSE and MAE are also the most common loss functions for regression problems.

11. Root Mean Square Error: Root Mean Square Error (RMSE) is a modification of MSE. This metric is equivalent to the square root of MSE. Like MSE and MAE, for ideal model RMSE is zero.

12. Gini coefficient: No, it is not related to Disney’s Aladdin if that’s what you are thinking.

Aladdin’s Genie image on disneyclips.com

This metric is used to do a quality comparison of different models. Gini coefficient’s value ranges from 0 to 1 where 0 denotes perfect equality and 1 represents perfect inequality. It is generally used for imbalanced class values, and its higher value indicates more dispersed data.

Gini coefficient can be derived from the area under the ROC curve:

13. Cohen’s Kappa Coefficient: Also known as the Kappa score, it measures the degree of agreement between two evaluators, in the ML model evaluation metrics these two are the predicted and actual output values. Agreement is basically the ratio of the number of matching values to the total number of values. Some portion of the agreement could be because of chance so, we introduce the ChanceAgree factor in kappa score.

This metric is mostly used in multi-class classification problems. The Chance agreement is equal to the sum of the conditional probabilities of each class by each evaluator.

Thank you for reading my article! I hope it helps you.

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Aryan Lala
Aryan Lala

Written by Aryan Lala

0 Followers

B.Tech 3rd Year Student. Focused on going into Research field in the domains of Artificial Intelligence, Internet of Things, Deep Learning and Biometrics.