Preprocessing Data for Logistic Regression

Published in

My Data Science Journey

5 min readMar 1, 2019

As far as I understood, preprocessing the data is an important part of data analysis. In this article, I will show how to prepare the data for logistic regression using absenteeism data which is show the absenteeism at the company during work hour.

We’ll look at predicting absenteeism from work. More precisely we would like to know whether or not an employee can be expected to be missing for a specific number of hours in a given workday. Having such information in advance can improve our decision making.

We want to know for how many working hours an employee could be away from work based on information such as how far they live from their workplace. How many children and pets they have. Do they have higher education and so on. But firstly, We are going to prepare our data for logistic regression.

Data Preposeccsing

#import the pandas module
import pandas as pd
#load the data
df = pd.read_csv(‘data/Absenteeism-data.csv’)
df.head()

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 12 columns):
ID                           700 non-null int64
Reason for Absence           700 non-null int64
Date                         700 non-null object
Transportation Expense       700 non-null int64
Distance to Work             700 non-null int64
Age                          700 non-null int64
Daily Work Load Average      700 non-null float64
Body Mass Index              700 non-null int64
Education                    700 non-null int64
Children                     700 non-null int64
Pets                         700 non-null int64
Absenteeism Time in Hours    700 non-null int64
dtypes: float64(1), int64(10), object(1)
memory usage: 65.7+ KB

The ID is the individual identification for each person and doesn’t have any numeric information. The Id column does not have any value for predicting the absenteeism hour from the work. So, we can drop the ID column from the data frame.

df = df.drop(['ID'], axis = 1)

The Reason for Absence:

#Let’s extract a list containing distinct values from reason for absence
df[‘Reason for Absence’].unique()array([26,  0, 23,  7, 22, 19,  1, 11, 14, 21, 10, 13, 28, 18, 25, 24,  6, 27, 17,  8, 12,  5,  9, 15,  4,  3,  2, 16])

How can we extract some meaning from these numeric values? Similarly to the case of the ID column, the values here do not actually have a numeric meaning they represent categories that are equally meaningful. So, we can turn these values into dummy variables which are an explanatory binary variable that equals 1 if a certain categorical effect is present and equals 0 is not.

reason_columns =pd.get_dummies(df['Reason for Absence'])
reason_columns.head()

If we add all these dummy variables in the df data frame we would end up with a dataset containing nearly 40 columns. So, we could reorganize a certain type of variables into groups

Group 1: Related to various diseases.
Group 2: Related to pregnancy and giving birth.
Group 3: Related to poisoning or signs not elsewhere categorized.
Group 4: Represents light reasons for absence such as a dental appointment physiotherapy a medical consultation and other.

#max() function here  substitute an entire row in new single one column
reason_type_1 = reason_columns.loc[:, 1:14].max(axis=1)
reason_type_2 = reason_columns.loc[:, 15:17].max(axis=1)
reason_type_3 = reason_columns.loc[:, 18:21].max(axis=1)
reason_type_4 = reason_columns.loc[:, 22:].max(axis=1)   ## from 22 to end of data frame#Let’s concant all reason’s
df_reason = pd.concat([reason_type_1, reason_type_2,reason_type_3,reason_type_4], axis=1)
#create the varible with column name
reason_column_name = [‘Reason_1’,’Reason_2',’Reason_3', ‘Reason_4’]
#Rename the column
df_reason.columns = reason_column_name
df_reason.head()

We still have the reason for absence column in df. We need to add all those dummy variables to our current df. But, we could observe duplicate information calling multicore linearity which is something we should be avoided in general. Therefore in our case, we could drop the reason for absence column from the df data frame. Then concatenate the data frames.

df  = df.drop(['Reason for Absence'], axis = 1)
df = pd.concat([df_reason, df], axis=1)

Date values have been stored as text what we will do now is introduce a data type called timestamp. We will convert the strings from the date column into timestamps with the help of the format parameter.

df['Date'] =pd.to_datetime(df['Date'], format='%d/%m/%Y')

We will create the Month Value and Day of the Week column from the Date column.

list_months =[]
for i in range(df.shape[0]):   #df.shape[0] takes the row from the df dataframe
    list_months.append(df['Date'][i].month)
df['Month Value'] =list_months#then we create the function to implement it to all values from the column of interest.
def date_to_week(data_value):
    return data_value.weekday()
df['Day of the Week'] = df['Date'].apply(date_to_week)
df.head()

Education column contains only the values 1, 2, 3, and 4.
1: higher degree
2: graduate
3: postgraduate
4: master or a doctor

Let’s use panda’s value_counts method to see the values from the education column how may times countered.

df['Education'].value_counts()
1    583
3     73
2     40
4      4
Name: Education, dtype: int64

We can see that nearly 600 people from high school education only while just above one hundred have a better degree than that. Therefore separating between graduate postgraduate and doctor degrees becomes less relevant for this study and it would make sense to combine these in a single category.

1 -> 0 higher degree
2 -> 1 graduate
3 -> 1 postgraduate
4 - >1 master ora doctor

df['Education'] = df['Education'].map({1:0, 2:1, 3:1, 4:1})
df['Education'].unique()
 array([0, 1])df['Education'].value_counts()
 0    583
 1    117
 Name: Education, dtype: int64

The transportation expense, distance to work, age, daily work Load, body mass index columns won’t manipulate in any way. They will be left without a change.

As the last step, we should save our preprocessed file into .csv for machine learning study.

df_preprocessed.to_csv(‘data/absenteeism_preprocessed.csv’, index=False)

Final Thought

You may find this study in the Udemy course with more detail and also check my githup for preprocessing part.

I have written this article to improve my data analytic skills so I am still a learner. Please let me know any additional information or comment on this article.

Follow me on Twitter, Linkedin or in Medium.

Preprocessing Data for Logistic Regression

Data Preposeccsing

Final Thought

Written by Ayşe Bat