Why Linear Regression does not work for classification-Part II?
Complete Analysis.
I suggest you to check Part I before proceeding ahead.
In Part I, we checked if Linear Regression worked for Qualitative Response Variable with more than 2 classes. Here, we will check if it works for Binary Classification?
We will work with Breast Cancer Prediction Dataset this time. For simplicity reason, I have only considered the mean perimeter of the tumor as the only feature in data.
Based on the mean perimeter size, we will determine if a tumor is benign (non-cancerous) or malignant (cancerous) with Linear Regression.
Our data looks something like this:
To work with Linear Regression, let’s assume encoding for B and M as 0 and 1 respectively.
Let’s run Linear Regression on this:
Code.
coefficient: [0.01291051]
Intercept: -0.9710559009446079
MSE: 0.03373167568329971
R2_score: 0.747858980220586
Now, let’s chose the other possible encoding i.e. B as 1 and M as 0 and rerun the code.
coefficient: [-0.01291051]
Intercept: 1.9710559009446085
MSE: 0.03373167568329972
R2_score: 0.747858980220586
Mean Squared Error and R-squared values are exactly the same for both encodings. Even the coefficients are the same and only their signs are different.
Let’s plot the hypothesis h(x) for both encodings:
So there is no problem in encoding as we had in classification with more than 2 classes. We can assume any one of those and we will get the same results.
To make predictions for any given tumor size x, if h(x) is bigger than 0.5, we predict malignant tumor, otherwise, we predict benign, specifically for the case when B is 0 and M is 1.
So this is what we get:
Looks like we have correctly predicted every single data point except one, but now let’s change the data a bit.
Let’s add another sample with a huge tumor size, and run Linear regression again:
Now with h(x)>0.5 is malignant will look like this:
To keep making correct predictions we need to change it to something like h(x)>0.3 but that’s not how the algorithm should work.
We cannot change the hypothesis each time our data updates. Instead, the hypothesis should learn it off the training data, and then make correct predictions for the data that it hasn’t seen before.
Conclusion
To conclude, Linear Regression is not suitable for Binary Classification as well.
It can be used as a binary classifier by imposing a decision rule such as when the value of hypothesis h(x)>0.5 then the tumor is malignant. But this too does not work for every scenario.
For complete code, check this GitHub link.
Please comment for any kind of suggestion, correction, or criticism.
Thank You!