Nerd For Tech
Published in

Nerd For Tech

Data Preprocessing in Python: All important steps explained

I have been working with data for the last 3 years, over the period of time lots have changed in the machine learning industry. What has not changed is the data preprocessing and cleaning. Surprisingly most of the practitioners are reluctant to work or learn that part.

Since I have been doing it almost every day, I have decided to write down the common steps and practices that I follow for the preprocessing of the data.

Loading Data

You can use any CSV of your choice. In our example, we will be using the data of employees and their absence hours. We will have multiple features which can be later used to predict the employee’s absenteeism. To begin with load and look at the data carefully.

import pandas as pd
raw_csv_data=pd.read_csv("absenteeism_data.csv")
df=raw_csv_data.copy()
df

The copy function is used to create a copy of the table so as to use it later in the raw format if needed. You can skip that step if you want.

Getting full output data in Jupyter notebook

When I was new to Jupyter I used to fret a lot about not getting the complete output in the notebook. Let’s discuss that first

pd.options.display.max_columns= None
pd.options.display.max_rows= None

These two lines will fetch your complete data and hence will make it easier for you to look at complete data by scrolling.

Understanding Data

When you are starting a completely new problem and have data alien to you, this is what you should do(I mean That’s what i do).

df.info()

Info() function will help you in knowing whether you have any null-values and the data type of your features. if you are an ML practitioner, you know the importance of these parameters.

df.describe()

Describe() function will mathematically evaluate your features. This is not a compulsory step but sometimes can prove handy while understanding the data and the columns. Remember the better your understanding of data, the better your model performs.

Removing Redundancy

We will try to find the columns which are not having any impact on our dependent variable. The dependent variable is the one that tells us about output or result. For example in our Data, the Last column is the dependent variable, which is telling us if the employee was absent or not.

I have figured that the ID column in no way can impact the employees' decision to take a leave or their absence. So let’s get rid of it.

df=df.drop(['ID'],axis=1)
df

Okay nice, we dropped it. What other column do you think we should drop? I’m not dropping any other for now but you can write it down in the comments.

One-Hot Encoding

Let’s be honest. This is a tough topic for everyone, even I struggle at it. if you can’t comprehend it in the first go, remember everyone else faces the same. So enjoy.

look at the columns again.

There is a column “Reason for Absenteeism” with numerical values of a large range and every numerical value having something associated with it. Take a sample table for example.

This table is not complete. I created a dummy table to illustrate. Let’s understand the values of this column.

sorted(df['Reason for Absence'].unique())

So basically we have values ranging from 0 to 28, all these values have a reason corresponding to them.

We cannot use them in this form, since computers will understand them numerically and might consider 28>1, it is incorrect because we are simply using different numbers to refer to a reason for absenteeism. Hence 28=1. The frequency of these numbers is what matters(except 0). How many times have number 28 repeated in our data is what decides its significance.

So we will use one-hot encoding to use these. we will start will adding dummy values to the column.

reason_columns=pd.get_dummies(df['Reason for Absence'], drop_first=True)
This is not the complete photo and the total rows will be 700

We created a matrix of dim(700,28). 28 columns because we have 28 classes. This way we can find the frequency of our classes rather than using their numerical values. Take the example of 7 in the 3rd row. This implies that the third row had a value of 7.

However 28 columns is a bit too much, isn’t it?

Let’s assume that some of these classes had similar reasons. so we can club them. Look at this chart to understand it in a better way

We can club all of these reasons under 1 class and the same for the other values as well. For example, out of 28 values, we can create 4 classes. It is important to point out here that we will face a loss of data here. Once we label data under 4 classes we will not be able to find the exact cause later on. So this is only advisable if this information is groupable or say similar. Hence the loss of data will not have an overall impact. let me club the first 14 columns and create 1 column(class) for it

reason_type_1=reason_columns.loc[:,1:14]
reason_type_1

We will create 1 column by using function max(). once we use this function we will know 1–14 was present but information about which of the value was present will be lost.

reason_type_1=reason_columns.loc[:,1:14].max(axis=1)
reason_type_1

Here we know the third row had value one which means any of the values between (1–14) was present but which one? as already explained, we have no idea now.

Note: the column values are all very distinct then we should not group this data.

Now let’s group all the data in 4 classes.

reason_type_1=reason_columns.loc[:,1:14].max(axis=1)
reason_type_2=reason_columns.loc[:,15:17].max(axis=1)
reason_type_3=reason_columns.loc[:,18:22].max(axis=1)
reason_type_4=reason_columns.loc[:,23:28].max(axis=1)

Concat columns with Dataframe

Let’s first drop the column “Reason of absenteeism” from the table and add the new four columns we have created.

if we don’t drop this column, we will suffer from a problem called collinearity. It is an important concept of statistics and is often asked in interviews.

df=df.drop(['Reason for Absence'], axis=1)

Let’s concatenate the four columns in place of this 1 column “Reason of absenteeism”. Remember there were 28 columns that were supposed to be replaced with this one column but we grouped these under 4 columns.

df=pd.concat([df,reason_type_1,reason_type_2,reason_type_3,reason_type_4 ],axis=1)

Next, we need to rename these columns to make them more readable.

column_values=['Date', 'Transportation Expense',
'Distance to Work', 'Age', 'Daily Work Load Average',
'Body Mass Index', 'Education', 'Children', 'Pets',
'Absenteeism Time in Hours', 'reason_1', 'reason_2', 'reason_3','reason_4']
df.columns=column_values
df

Let’s reorder this table and bring our columns in the front, that’s where they belong.

column_names_reordered=[ 'reason_1', 'reason_2', 'reason_3','reason_4','Date', 'Transportation Expense',
'Distance to Work', 'Age', 'Daily Work Load Average',
'Body Mass Index', 'Education', 'Children', 'Pets',
'Absenteeism Time in Hours']
df=df[column_names_reordered]

This looks good. We are done with this column, Let’s go.

Dealing with date-time values

Are you even a data guy if date-time columns don’t give you nightmares? how many times have you been asked to manipulate or work with date columns in your interviews? Yes, right.

let’s work on making sense out of the date column. When we import the data, the Date column is loaded as a string data type. To apply date functions we need to change them into DateTime format.

df_reason_mod=df.copy()
df_reason_mod['Date']=pd.to_datetime(df_reason_mod['Date'], format='%d/%m/%Y')

Yes, we did it. Let’s extract the date and month from this column and create two more columns here.

list_months=[]
for i in range(len(df)):
list_months.append(df_reason_mod['Date'][i].month)
df_reason_mod

We have extracted the month value out of the date column. Let's extract the day value now.

def date_to_Week(date_value):
return date_value.weekday()
df_reason_mod["day_of the week"]=df_reason_mod["Date"].apply(date_to_Week)
df_reason_mod

we can now drop the Date column.

df_reason_mod.drop(['Date], axis=1)

So we did it ML practitioners and data folks. Congrats.

Saving the data to CSV

Now that you have cleaned the data, you can send this data to your fellow colleague who is an ML engineer, and let him make the models.

df_preprocessed=df_reason_mod.copy()
df_preprocessed.to_csv('Absenteeism_preprocessed.csv', index=False).

Conclusion

These were the most common operations that I perform while preprocessing the data. One important step that we did not cover is to deal with NULL values because luckily we had no NULL values in our data.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Anmol Hans

Anmol Hans

Geopolitics and Data Science enthusiast.