Simplified Diagram for Logistic Regression

Code Candy Logistic Regression : Encoding Categorical Variables in Python.

Anindita Sengupta
Analytics Vidhya

--

Writing code for data mining with scikit-learn in python, will inevitably lead you to solve a logistic regression problem with multiple categorical variables in the data. Though scikit-learn supports both simple and multinomial regression, it expects the inputs to be in numerical form to calculate the result using its internal activation functions.

Here is the simplest method of transforming a categorical variable into a numerical variable understandable by the scikit-learn modules.

Let’s consider the below data set, where one may infer that the experience of a person determines the salary drawn. Now, it’s clear that both the independent and dependent variables here are in the form of a range. Using a scikit- learn module for Logistic Regression will return an error as the internal modules won’t be able to convert string to float. Therefore we need to specify this columns as categorical variables in python and transform the values to a new set of numerical values for each category.

eg Imagine a data set containing information about the employee Sex.

Instead of having values as {‘male’,female’,’female,’male’ } in our dataset, we will encode them to a numerical value of{1,0,0,1}and use this numerical tuple in our modeling equation.

Consider these two columns to be a part of the data set data_train. The following code will declare this two columns to be of type category to Python and the encoded columns can be further used to fit the data to logistic regression.

data_train[‘experience’] = data_train[‘experience’].astype(‘category’
data_train[‘exp_cat’] = data_train[‘experience’].cat.codes
Test=data_train[‘exp_cat’]

Test.head() returns the following values instead of the categories displayed above .
0 104
1 19
2 106
3 112
4 8

Now a data set like this can be used to build the 2d array x in the below code to develop a Logistic Regression module

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression().fit(x, y)

This is a small tip for programming! Hope it is useful!

--

--