Classification Performance Metric with Python Sklearn

Haneul Kim
Analytics Vidhya
Published in
6 min readFeb 7, 2021
Photo by Emily Morter on Unsplash

Today we are going to go through breast_cancer dataset from Sklearn to understand different types of performance metrics for classification problems and why sometimes one would be preferred over the other.

Even though it is a simple topic answer does not immediately arise when someone asks “what is precision/recall?”, by going over examples I hope everyone including myself get firm grasp on this topic.

Most people just use “Accuracy” as performance metrics in classification problems even though it is okay(sometimes), should be avoided in case of working with imbalanced dataset.

Metrics are what we use to compare different models therefore we could choose most appropriate model for our problem

So using inappropriate metric can lead to selecting non-optimal model. Imbalanced dataset refers to having imbalance class labels.

Here are key topics we will cover today:

  • Confusion Matrix
  • Precision, Recall, F1-score
  • ROC curve, ROC AUC curve

First let’s import breast_cancer dataset along with dependencies. Furthermore we will exclude some “Malignant” patients from our dataset to represent rarity of “Malignant” patients of real world.

We’ve converted our data loaded from sklearn to pandas dataframe and excluded all but 30 malignant patients. To double check count for each label we’ve grouped by and counted target value.

Target count

We can see that out of 387 patients 30 are malignant and 357 are benign. It is common convention that what we want to classify is called “positive/1” class, since we want to classify “Malignant” patients, we should originally give positive/1 label to it however in this case it is the opposite.

Let’s create two models and compare their performance using accuracy.

Model 1 (base classifier): Simply classify every patient as “benign”. This is often the case in reinforcement learning, model will find fastest/easiest way to improve performance. When model finds out classifying all patient as “benign” actually leads to good performance(when using accuracy) it will likely stay there.

Model 2 (rf_clf): Random Forest Classifier, because we didn’t want to scale X-variables…

Accuracy_score is proportion of patients it correctly classified, either as malignant or benign.

We can see that even with our base classifier(all patients are benign) we get pretty high accuracy therefore we might be convinced to choose this model to classify breast cancer. However since base classifier classifies all patient as benign, some of them actually will be malignant which can lead to detrimental consequences. In such case we should give more importance on trying to correctly identify cancer even though we classify some benign patient as malignant because knowing you have cancer when actually not is better than not knowing you have cancer.

This is where Confusion Matrix comes into play. Each row in confusion matrix represents predicted class and column represents actual class. So in our example first row predicts all patients as having cancer and depending on actual class it is assigned True if correct else False.

One trick I always use is to first look at predicted label Negative/Positive then look at whether prediction was True/False.

Here are some keywords in confusion matrix:

  • True Negative: Predicted malignant, actually malignant
  • True Positive: Predict benign, actually benign
  • False Negative: predicted malignant, actually benign
  • False Positive: predicted benign, actually malignant

Confusion matrix is also offered by sklearn, simply pass in actual_y and predicted_y into confusion_matrix() method.

Below shows confusion_matrix for two models we’ve created. Our base model leads to detrimental consequences because it cannot identify malignant patients, whereas even though random forest model identifies one less benign patients it reduced number of not identifying malignant patient therefore it would be preferred model for our problem.

Confusion matrix is useful way to compare models however say we are comparing multiple models in automated way, it would be difficult to draw multiple confusion matrix and compare them. Therefore we must use quantitative measure and this is where Precision and Recall comes in.

Precision: From all positive predictions how many are correct? Out of all benign prediction how many are actually benign?

  • Precision = TP/(TP + FP)

Recall: From all actual positives how many are predicted as positive? Out of all benign patients how many were predicted as benign?

  • Recall = TP/(TP + FN)

It is easy to get mixed up, the difference is in their denominator. For precision predictions are in denominator and for recall, truth are in the denominator.

precision score, recall score are both offered as methods in sklearn. We can double check by calculating Precision and Recall by using values from our confusion matrix.

Note that depending on the problem sometime you want high precision(identifying cancer) whereas sometimes you would want high recall(identify shoplifters). Perfect model would have both good precision and recall however in real life there exists trade-off which can be adjusted by changing threshold. For example if we lower threshold from 0.5 to 0.4 more values in 0.4~0.5 are now classified as “1” which increase recall and decrease precision.

ex: Precision_base_model = 106/(106 + 11) = 0.906

Here we need to compare two metrics, even though it is easier than using confusion matrix we can make it simpler by combining the two, F1-score.

  • Score ranges from [0,1] and it is harmonic mean of precision and recall that is, more weights are given to lower values.
  • Favors classifier with similar precision and recall score which is the reason it is also referred to as “balanced F-Score”.

Just like all other metrics f1_score is offered as sklearn method.

Previously we’ve talked about adjusting our threshold to obtain precision and recall that is specific to our problem. In order to do that we must summarize all confusion matrices that each threshold produce, this what ROC Curve is used for.

Receiver operating characteristic (ROC) curve plots True positive rate(precision) vs. False positive Rate(1-TNR) for each threshold we want to compare.

Lets graph ROC curve for each classifier to compare which threshold we should choose then by using AUC (Area Under ROC curve which is used to compare ROC curves) score we will compare ROC curve of two classifiers.

Black diagonal line representing TPR = FPR is covered by our base classifier which means that proportion of correctly classified Benign patients is same as proportion of incorrectly classified samples that are Malignant.

We’ve graphed using 3 thresholds [2,1,0] which plots points [(0,0), (0.2,1), (1,1)] respectively and from the graph we can see that threshold of 1 yield combination of precision and FPR therefore it is most preferred threshold.

Lastly we can see that our random forest classifier has much higher AUC score therefore we can say that random forest classifier has better performance over base classifier.

In conclusion, having correct performance metrics is as important as having a descent model since use of incorrect metric may guide you in a wrong direction. Even though we only covered binary classifiers these metrics are expanded to multi class classification problems as well. Hope this blog was helpful and hope that everyone chooses appropriate metric for their model!

--

--

Haneul Kim
Analytics Vidhya

Data Scientist passionate about helping the environment.