Data Science: Machine Learning Models Metrics

9 min readJan 15, 2022

This article is written by Alparslan Mesri and Cem ÖZÇELİK.

There are some metrics used to measure the performance of the models established in machine learning models, which are a part of the data science world. These are used differently in classification and regression models. In this article, we conducted a study together on the metrics used to measure the performance of machine learning models.

What is Confusion Matrix?

Confusion Matrix is a performance indicator used in classification algorithms. Here, the actual situation and the results of the established model are compared and put into a table. The column on the left of the Confusion matrix represents the Actual state, and the rows represent the prediction results of the established model (Predicted).

Let’s explain this situation with an example, let’s assume that we are estimating the health status of the passengers who died and survived the accident on the RMS Titanic. The Actual 0 line in the first line of the table shows whether any passenger was alive or dead in the actual accident. Likewise, the row corresponding to Actual 1 represents the passengers who actually survived. The column of the table with the expression Predicted 0 represents the established model’s prediction that a passenger onboard has died. Likewise, the Predicted 1 column represents the traveler’s survival. In our established models, a passenger we predicted to survive may actually have died, or a passenger we predicted to have survived may actually have died.

We have explained the situations represented by the rows and columns of our table. Now, let’s explain what the synonyms TN, FP, FN, TP, which are also seen in the image, are.

TN (True Negative): True Negative values are the values that we correctly predict that the passengers who lost their lives in the accident also lost their lives in the established model.

FP (False Positive): False Positive values are the values that predict that the passengers who lost their lives in the accident will survive the accident in the model.

FN (False Negative): False Negative values indicate the number of passengers who survived the crash and the number of passengers predicted to die in the model.

TP (True Positive): This synonym of Confusion Matrix is the number of passengers that correctly predicts that the passengers who survived the accident will also survive the crash in the model.

Classification problems can appear as binary or multiple classification problems. If the problem we are working on is in the form of binary (death — survive) as in the titanic problem, the confusion matrix above will be sufficient for the problem. In multi-class problems, for example, when determining the champion team in a football league, there will be many classes. In this case, it will be necessary to add to the confusion matrix.

We learned the basics about the confusion matrix. Now let’s give information about some metrics. These are metrics like Precision, Recall, and Accuracy.

Accuracy

It is a widely used criterion to measure how successful the model is. It expresses the ratio of the number of correctly classified samples (TP+TN) to the total number of samples.

Let’s continue with our Titanic example. In our dataset we used, we saw that there were 1470 passengers on the RMS Titanic ship. Let’s assume that the situations for 1470 passengers are as follows:

The accuracy value of our binary classification model, which we have established according to the confusion matrix image above:

Accuracy: (400+1000) / (400+1000+45+25) = 95%

As can be seen, according to the hit score metric, we can predict with 95% accuracy who survived the crash of our model Titanic and who lost their lives.

We will proceed from a different example to explain another performance indicator, the precision metric, which shows the precision of our model.

Precision

For example, let’s say you work in a bank. And suppose we receive loan applications from various customers to our bank. In this case, let’s assume that we are trying to prevent bad loans in order to prevent financial losses that may occur in our bank’s reserves. The bad loan we are talking about here is the situation where the loan given to any customer is not repaid to our bank.

We build a model to predict who will not pay the loan back and who will pay it back. In such a situation:

TP: The customer didn’t pay back and we forecasted as the customer will not pay it back.

FP: The customer paid back but we forecasted as the customer will not pay it back.

The reason why we use precision instead of accuracy in this scenario, considering that our bank is sensitive to losses that may arise from bad loans, giving credit to the right customer instead of giving credit to many people will prevent our possible loss to a large extent. In this case, we use the precision metric to make sure we’re giving credit to the right customer.

When to use this metric: performs well in the high cost of FP. In our example, if the bad loan cost is very important for the bank, and the profit we will get from the loans we will give is very low, precision should be used as the metric.

Let’s assume, our model predicted that our 30 loan applier customer will not pay their loan back. To test the model we gave load to these 30 customers and only 10 of them didn’t pay the loan. In this situation our model’s precision number is calculated as:

10 / 10 + 20 = 33%

Recall

When to use this metric: performs well in the high cost of FN. For example, let’s say you build a model which predicts who has cancer and who has not. In such a situation:

TP: The person has cancer and we predicted he has.

FN: The person has cancer but we predicted he has not.

In such a case our FP cost is relatively not so high. To say to a patient “you have cancer” while he hasn’t is not so much bad. But saying to a patient “you don’t have cancer”, while actually, he has, is a much more costly prediction.

F1 Score

As we explained based on the examples given, our two important performance metrics, recall and precision, that we use while evaluating our model performance, provide us with performance benefits in different subjects, but they operate in an inverse manner with each other. To avoid this complexity, we can use the F1-score as another performance metric.

Real positive values ratio (Recall) and the harmonic average of precision-precision (Precision) are taken in the model established to penalize FP and FN situations, which are characterized as extreme situations as F1-score working logic. This criterion is a measure of how well the classification model we have built is performing and is often used to compare classification models.

ROC Curve (Receiver Operating Characteristic Curve)

ROC Curve is a widely used method when evaluating the performance of models established in classification problems. Although it is theoretically complex, to summarize, it is actually a measure calculated based on two simple metrics.

TPR (True Positive Rate): The true positive rate is a synonym of recall, which we discussed earlier. It can be given by this formula:

FPR (False Positive Rate): The false-positive rate is calculated as follows:

After calculating these two metrics, we get a graphic image with TPR x-axis and FPR y-axis and calculate the area under the line. This is also called the AUC Curve (Area Under Curve).

When a random classification is made in an established classification model, the area under the line becomes 0.5, as represented by the red line in the image above. When a random classification is made in an established classification model, the area under the line becomes 0.5, as represented by the red line in the image above. The larger the area under the line representing the TPR / FPR ratio of the model established on the ROC Curve, the higher the success rate of the model. In summary, the larger the F1 score value in a model, the larger the area under the line representing the model will be. As can be seen in the image above, we can say that the area under the blue curve is larger than the orange line, so it is a model with better performance.

In regression models, an independent variable Y is estimated using a number of dependent X variables. The results obtained in the models used here may not fully match the reality or we will get wrong results. Therefore, the first question to be asked in this regard is how wrong the wrong result obtained is. In other words, what is the distance between the results obtained and the reality?
We have described the metrics describing the performance of machine learning models established for classification problems in the previous sections. In this section, we will be explaining the metrics that show the performance criteria for the regression models, as it is commonly used in the literature, or “Regression” models.

R Square and Adjusted R Square:

The R Square criterion used in a regression model shows how much of the variation or variation on the dependent variable of the independent variables, X, that we used to obtain the dependent variable in the model, can be explained by the established regression model.

The R Square measure is the square of the correlation coefficient, and R square does its work without taking into account the “overfitting” problem, which we call over-learning. Too many independent variables in the established regression model may cause the model data to be in high agreement with the training data. However, the established highly compatible model may not achieve the same success in the testing phase. In this case, we can apply the R Square method. The adjusted R Square, on the other hand, penalizes the variables by adding additional independent variables to the established model and prevents overfitting.
We can calculate the R Squared and Adjusted R Squared methods as follows:

Mean Squared Error(MSE):

After calculating the R Square criterion, let’s consider another performance criterion for our regression model, the “Mean Squared Error”.

The mean square error is a criterion that shows how different the results obtained from the established regression model are from the true value. It is unlikely that we can infer too much from just one result, so comparing the mean R squared values for different models helps us obtain the best regression model.

Root Mean Squared Error (RMSE)

The root mean squared error is the square root of the mean squared error value that we explained in the previous section. More RMSE is found in studies compared to MSE because sometimes the MSE value can produce values that cannot be compared with other models. In this case, RMSE is used. However, MSE is sensitive to situations that come across as outliers, also known as outliers.

Thank you for reading. See you in our next article.

The References →

Veri Bilimi Sınıflandırma Model Çıktılarını Değerlendiren Metrikler (Confusion Matrix, Accuracy…

Veri bilimi gözetimli (supervised) makine öğrenmesinde iki tür modelleme vardır. Bunlardan biri hedef değişken sürekli…

yigitsener.medium.com

https://medium.com/towards-data-science/top-data-science-business-metrics-c7ae905c076a

Understanding Data Science Classification Metrics in Scikit-Learn in Python

In this tutorial, we will walk through a few of the classifications metrics in Python’s scikit-learn and write our own…

towardsdatascience.com

Data Science Performance Metrics for Everyone

Accuracy, recall, precision, sensitivity, specificity, … — data scientists use so many performance metrics! How do you…