Why Linear Regression does not work for classification-Part I?

Nikita Gupta
Analytics Vidhya
Published in
4 min readFeb 20, 2020

Complete Analysis.

Part I: If Linear regression is suitable for Qualitative Response Variable with more than 2 classes? Why and why not?

Part II: If Linear regression is suitable for Qualitative Response Variable with 2 classes? Why and why not?

Suppose we are given the features of some species of flower and we need to classify the species that are ‘Iris-Setosa’, ‘Iris-Versicolor’ and ‘Iris-Virginica’ based on their features like sepal length, sepal width, petal length, and petal width.

In practice, for this kind of data, we use Classification algorithms like Logistic Regression, Linear Discriminant Analysis, SVM, etc. But, can we solve it using Linear Regression? Let’s check.

The above mentioned data is IRIS data and you can check it here.

I also suggest you to read about Linear Regression before going ahead.

The data looks something like this:

IRIS flower species with respect to their petal length and petal width

To work with Linear Regression, we need to consider some encoding for three species of flower:

First Encoding for IRIS data Response variable

All set!

We can now use Linear Regression to fit IRIS data with this encoding and predict Y.

Let’s run the code.

Code.

Coefficients: [-0.16079931 -0.03026038 0.27888762 0.53921169]
Mean Squared Error: 0.060240061710185554
R2_score: 0.9158135783553307

Well, we are getting quite good results with it.

To make predictions, I assume predicted values from 0.5 to 1.5 as 1(Iris-setosa), 1.5 to 2.5 as 2(Iris-versicolor) and 2.5 to 3.5 as 3(Iris-virginica).

Actual and Predicted results by Linear Regression with First encoding (Red circle depicts points where Actual is not as same as Predicted)

But Wait!

We could have chosen an equally reasonable encoding like this:

Second Encoding for IRIS data Response variable

Let’s rerun the code by keeping everything same except this encoding.

And this time we get:

Coefficients: [ 0.22466196 -0.82830855  0.03601207 -0.78051466]
Mean Squared Error: 0.519702808980069
R2_score: 0.22688838333543437

I assume predicted values from 0.5 to 1.5 as 1(Iris-virginica), 1.5 to 2.5 as 2(Iris-setosa) and 2.5 to 3.5 as 3(Iris-versicolor).

Actual and Predicted results by Linear Regression with Second encoding

So what just happened?

We were getting an R-squared of 0.91 and now it is only 0.22 and the Mean Squared Error also increased from 0.06 to 0.51.

The reason is clearly encoding but why?

Unfortunately, encoding implies ordering on the outcomes.

When we assume first encoding, we are indirectly stating that the difference between Setosa & Virginica and Versicolor & Virginica is the same.

While second encoding suggests that the difference between Setosa & Virginica and Versicolor & Setosa is the same.

With this kind of qualitative values, this doesn’t make any sense.

If the Qualitative Response Variable has values like low, medium, high then 1, 2, 3 encodings would be reasonable and we can assume that the difference between low and medium is the same as the difference between medium and high.

These different encodings produce different linear models that ultimately lead to different sets of coefficients on test observations.

Our first encoding is giving better results, but, does the first encoding is more reasonable to choose than the second one?

No, both encodings are equally reasonable and we have no logical explanation to choose the first one over second, one could choose any.

Conclusion

In practice, there is no natural way to convert qualitative response variables with more than two classes into a quantitative variable and fit with Linear Regression.

For these reasons, it is preferable to use the suitable Classification method for Qualitative Response Variable.

For complete code check this GitHub link.

Check Part II for Linear Regression with Binary Classification.

Please comment for any kind of suggestion, correction, or criticism.

Thank You!

--

--