Why Linear Regression does not work for classification-Part II?

Nikita Gupta
Analytics Vidhya
Published in
3 min readFeb 20, 2020

Complete Analysis.

I suggest you to check Part I before proceeding ahead.

In Part I, we checked if Linear Regression worked for Qualitative Response Variable with more than 2 classes. Here, we will check if it works for Binary Classification?

We will work with Breast Cancer Prediction Dataset this time. For simplicity reason, I have only considered the mean perimeter of the tumor as the only feature in data.

Based on the mean perimeter size, we will determine if a tumor is benign (non-cancerous) or malignant (cancerous) with Linear Regression.

Our data looks something like this:

Breast Cancer Data Distribution(B represent Benign and M represent Malignant Tumors)

To work with Linear Regression, let’s assume encoding for B and M as 0 and 1 respectively.

Let’s run Linear Regression on this:

Code.

coefficient: [0.01291051]
Intercept: -0.9710559009446079
MSE: 0.03373167568329971
R2_score: 0.747858980220586

Now, let’s chose the other possible encoding i.e. B as 1 and M as 0 and rerun the code.

coefficient: [-0.01291051]
Intercept: 1.9710559009446085
MSE: 0.03373167568329972
R2_score: 0.747858980220586

Mean Squared Error and R-squared values are exactly the same for both encodings. Even the coefficients are the same and only their signs are different.

Let’s plot the hypothesis h(x) for both encodings:

Linear Regression on Binary Classification(LEFT represents B as 0 and M as 1 while RIGHT represents B as 1 and M as 0)

So there is no problem in encoding as we had in classification with more than 2 classes. We can assume any one of those and we will get the same results.

To make predictions for any given tumor size x, if h(x) is bigger than 0.5, we predict malignant tumor, otherwise, we predict benign, specifically for the case when B is 0 and M is 1.

So this is what we get:

Actual and Predicted results by Linear Regression with First encoding (Red circle depicts points where Actual is not as same as Predicted)

Looks like we have correctly predicted every single data point except one, but now let’s change the data a bit.

Let’s add another sample with a huge tumor size, and run Linear regression again:

Linear Regression on Binary Classification(LEFT represents actual data while RIGHT represents actual data with some outliers)

Now with h(x)>0.5 is malignant will look like this:

Linear Regression on Binary Classification(LEFT represents actual data while RIGHT represents actual data with some outliers) with a threshold of 0.5

To keep making correct predictions we need to change it to something like h(x)>0.3 but that’s not how the algorithm should work.

Linear Regression on Binary Classification on new data with some outliers with a threshold of 0.3

We cannot change the hypothesis each time our data updates. Instead, the hypothesis should learn it off the training data, and then make correct predictions for the data that it hasn’t seen before.

Conclusion

To conclude, Linear Regression is not suitable for Binary Classification as well.

It can be used as a binary classifier by imposing a decision rule such as when the value of hypothesis h(x)>0.5 then the tumor is malignant. But this too does not work for every scenario.

For complete code, check this GitHub link.

Please comment for any kind of suggestion, correction, or criticism.

Thank You!

--

--