Negative value of r-square in the data science world… Is it a myth?

Uncover the possibilities and consequences of sub-zero value of r-square in a linear regression

Amol Marathe
Data Science Insights and Predictions
3 min readSep 4, 2020

--

It is a well-known fact that “r-square” is fundamentally an estimation of “squared errors” in a machine-learning model. The next question peeps into mind that, how on earth would it be possible for a “square” of any entity to be negative? To your surprise, yes, r-square CAN be less than zero. In this article, we would not only rationalize the reasons behind this paradox but also unveil the facts about its consequences and learn more about the coefficient of determination for a linear regression machine-learning model.

To begin with, you must understand the theory behind linear regression. As you might already know, the primary goal of linear regression is to find that function (or that line) that best fits the given input data points. In addition, you must be aware that the accuracy of this function is a measure of how closely this line generalizes all the data points. To eliminate any confusion about the terms, first let us be clear that this accuracy of the function is nothing but the “r-square” value, which is also called the “coefficient of determination”.

To evaluate the coefficient of determination, we calculate the ratio of squared error of the regression line (y-hat) and the squared error of the mean of y in the dataset. To put it in other words, it is an estimation of how far or how close each data point lies from the regression line as compared to its distance from the y-mean line. It can be therefore inferred that the higher is the variance of the data set, the higher are the errors thereby forcing the r-square value to be lower. The mathematical formula to calculate ‘r-square’ is:

r-square or the coefficient of determination

Let us understand the formula clearly. Firstly, there are two reasons to square the errors instead of using actual error values.

1. We want to have only positive values as the data point can lie on either side of the regression line

2. We want to penalize the outlier points in the data for being too far from the regression line

It is obvious that the regression line is necessarily an inclined line because it is supposed to generalize the data points more accurately than a y-mean line, which is a horizontal line, due to the fact that y-mean is a constant for the entire data set. Hence, the data points are further away from a y-mean(horizontal) line as compared to the regression line resulting in higher errors in the y-mean line than the errors in the regression line. This encourages the ratio in the r-square formula to be less than one thereby making the r-square value positive. The following graph explains more about this case of positive r-square value.

Regression line for the ‘positive’ value of r-square

Interestingly, there is one rare case when the regression line is so poorly estimated that its errors are even higher than the y-mean (horizontal) line. In such a situation, the ratio in the r-square formula becomes more than positive causing r-square to become negative. Although it is a rare case in the data science context, however, the possibility of an extremely poor model cannot be denied due to unknowingly committed mistakes of a novice data scientist. This situation can be illustrated in a graph as follows:

Regression line for the ‘negative’ value of r-square

In conclusion, it is not a mathematically impossible situation or a programming bug although you are surprised to see a negative value for r-square or coefficient of determination of the accuracy. It just denotes that the machine-learning model you chose (with its constraints) fits the data poorly and is an extreme example of an under fitted model.

--

--

Amol Marathe
Data Science Insights and Predictions

Author, Explorer and Researcher in Data Science