Matthew’s Correlation Coefficient: A Metric for Imbalanced Class Problems
In data science and machine learning, we often encounter imbalanced class problems, where one class has significantly more instances than another. This makes accuracy a misleading metric. For instance, if class A represents 90% of the data, and class B only 10%, always predicting class A would yield 90% accuracy, which seems high but is uninformative about the model’s performance on class B.
One metric that addresses this issue is Matthew’s Correlation Coefficient (MCC), introduced by Matthews in 1975. Before diving into the calculation of MCC, let’s revisit the concept of a confusion matrix.
A confusion matrix has four cells, created by a combination of the predicted values against the actual values. Two of these cells represent correct predictions (True Positives and True Negatives), and the other two represent incorrect predictions (False Positives and False Negatives).
Calculating Matthew’s Correlation Coefficient
Matthew’s Correlation Coefficient is calculated as follows: