Why Linear Regression does not work for classification-Part I?
Complete Analysis.
Part I: If Linear regression is suitable for Qualitative Response Variable with more than 2 classes? Why and why not?
Part II: If Linear regression is suitable for Qualitative Response Variable with 2 classes? Why and why not?
Suppose we are given the features of some species of flower and we need to classify the species that are ‘Iris-Setosa’, ‘Iris-Versicolor’ and ‘Iris-Virginica’ based on their features like sepal length, sepal width, petal length, and petal width.
In practice, for this kind of data, we use Classification algorithms like Logistic Regression, Linear Discriminant Analysis, SVM, etc. But, can we solve it using Linear Regression? Let’s check.
The above mentioned data is IRIS data and you can check it here.
I also suggest you to read about Linear Regression before going ahead.
The data looks something like this:
To work with Linear Regression, we need to consider some encoding for three species of flower:
All set!
We can now use Linear Regression to fit IRIS data with this encoding and predict Y.
Let’s run the code.
Code.
Coefficients: [-0.16079931 -0.03026038 0.27888762 0.53921169]
Mean Squared Error: 0.060240061710185554
R2_score: 0.9158135783553307
Well, we are getting quite good results with it.
To make predictions, I assume predicted values from 0.5 to 1.5 as 1(Iris-setosa), 1.5 to 2.5 as 2(Iris-versicolor) and 2.5 to 3.5 as 3(Iris-virginica).
But Wait!
We could have chosen an equally reasonable encoding like this:
Let’s rerun the code by keeping everything same except this encoding.
And this time we get:
Coefficients: [ 0.22466196 -0.82830855 0.03601207 -0.78051466]
Mean Squared Error: 0.519702808980069
R2_score: 0.22688838333543437
I assume predicted values from 0.5 to 1.5 as 1(Iris-virginica), 1.5 to 2.5 as 2(Iris-setosa) and 2.5 to 3.5 as 3(Iris-versicolor).
So what just happened?
We were getting an R-squared of 0.91 and now it is only 0.22 and the Mean Squared Error also increased from 0.06 to 0.51.
The reason is clearly encoding but why?
Unfortunately, encoding implies ordering on the outcomes.
When we assume first encoding, we are indirectly stating that the difference between Setosa & Virginica and Versicolor & Virginica is the same.
While second encoding suggests that the difference between Setosa & Virginica and Versicolor & Setosa is the same.
With this kind of qualitative values, this doesn’t make any sense.
If the Qualitative Response Variable has values like low, medium, high then 1, 2, 3 encodings would be reasonable and we can assume that the difference between low and medium is the same as the difference between medium and high.
These different encodings produce different linear models that ultimately lead to different sets of coefficients on test observations.
Our first encoding is giving better results, but, does the first encoding is more reasonable to choose than the second one?
No, both encodings are equally reasonable and we have no logical explanation to choose the first one over second, one could choose any.
Conclusion
In practice, there is no natural way to convert qualitative response variables with more than two classes into a quantitative variable and fit with Linear Regression.
For these reasons, it is preferable to use the suitable Classification method for Qualitative Response Variable.