Dealing with Categorical Data

For Machine Learning - Multi-target Regression

Mulbah Kallen
Analytics Vidhya
Published in
5 min readSep 30, 2019

--

In our previous section we were able to greatly improve our R² score by changing our single linear regression model into a multivariate linear regression model. This was done by adding the remaining predictors(features) into our linear regression model and refitting(retraining) it.

That was easily done because all of our additional predictors were numerical values, but what if they weren’t? Review the dataset below and notice that it contains both numerical and categorical data. You’ll notice that our dataset is mostly categorical.

It is important to point out that categorical variables are of two types, Nominal and Ordinal.

An example of Ordinal Categorical Variables would be size recorded as small, medium or large being that large > medium > small.
On the other hand a Nominal Categorical Variable could be pet registered as cat , dog or bird, being that from the perspective of a machine learning model Cat should not contain any intrinsic value that is lesser nor greater than Dog or Bird.

The importance of this distinction is significant when it comes time to training our machine learning model. Models are unable to deal with categorical data left as is, categorical data must be converted into numbers. The danger lies in how we go about converting our data.

For Ordinal data there is no issue in assigning the data the following numerical values.

L - 3
M - 2
S - 1

Our model will will take into account the intrinsic nature of size differences L>M>S with the corresponding 3>2>1.

On the other hand assigning Cat the value of 1, Dog the value of 2 and Bird the value of 3 may cause our model to misinterpret the value of Cat, Dog and Bird greatly effecting the error function of our model

Error function — a method of evaluating how well specific algorithm models the given data

One solution to this is the method of one-hot encoding. This method allows us to give each variable of a predictor(feature) a binary value that can be accurately read by our model without giving one variable more weight than another. This would look something like the example below.

Now let’s turn our eyes back to our original dataset. It seems that our target datatype is also categorical. Additionally our target is not binary but multi-class. Again the purpose of dataset is to be able to predict one of three scenarios. Based on the patients current status, should they be sent to intensive care(I), the general hospital floor(A), or home(S). If our target were binary we could use a linear regression model again once our dataset were one-hot encoded but in this case that isn’t possible.

Our next option is to look into Logistic Regression. Although the word regression is in its name logistic regression is truly a model for classification not regression. Most often logistic regression models are designed to tackle a binary classification problem but SciKit Learn allows us to tune the logistic regression classifier in order to handle multi-class targets.

After you’ve gone through and cleaned your dataset it only takes a few lines of code to one-hot encode our data, initialize and fit our logistic regression model and see how well our model did.

notice our multi-class tuning is set to ‘ovr’ meaning One-vs-Rest

The reasoning that class_weight is set to ‘balanced’ is because of the disproportion of values when it comes to our target.

“The ‘balanced’ mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data”.

This is done so that our model doesn't assume that because 70% our target values are indeed A that it should just guess A 70% of the time.

Note the two out comes when our logistic regression is balanced vs unbalanced. Our unbalanced predictions consist of mostly A, where as our balanced predictions more accurately represent out true data.

UNBALANCED
BALANCED

Let’s view our final score.

There are a few things to think about here. Our model only predicted slightly better than the the proportionality of A to the rest of our data. This should lead us to be a bit skeptical about our model but we did balance our logistic regression model so perhaps our model is ok (80% and greater would be preferable). The goal from here forward would be to go back and see if there are better ways to fine tune our model and improve our score.

“logistic regression is for classification — and the problem we are dealing with is classification — logistic regression is the most simple linear model for classification” — Dave

--

--