Data Preprocessing: A Practical Guide

Understand it thoroughly by preprocessing — Titanic Dataset

Bala Kowsalya
7 min readMar 5, 2019

Nowadays, the data collection task becomes easier, you can use a wide range of sensors to capture the data from machines or you can send a survey form to gather data on user opinions or no need to do even that, you can download a huge set of data from sites like Kaggle for your experimentation. But wait, the readiness of the data we collect in those ways are often less to put into the analysis right away. You need to preprocess the data and make it fit for analysis.

Sharpen your data by Preprocessing ~ Dribbble

Data Preprocessing is the first step in starting to work with data, where data scientists spend most of their time!

What is data preprocessing?

Data Preprocessing is a technique that is used to convert the raw data into a clean data set.

We collect data from a wide range of sources and most of the time, it is collected in raw format which is not applicable for the analysis.

Quick fact: ‘Research says that data scientists spend around 80% of their time only on preparing the data for processing and 76% of data scientists view data preparation as the least enjoyable part of their work’ ~ Forbes

Why do we preprocess data?

We now know that data preparation is not a fruitful task. But, why we need to do that least enjoyable job?
Most of the time, we don’t get quality data. It often contains missing, noisy and inconsistent values. This can potentially reduce the accuracy of the end result. Therefore, we need to prepare data for processing further and escape from the bottleneck situation.
Do you all remember, Sharpening the axe story? 😁

No quality data! No quality results!

You can relate data preprocessing to the context of that story. Here, data is our tool, the axe! with which we are going to make our machine learn things/do the analysis to unravel so much of information. You need to clean and prepare that data before using it as a tool and get a surprising outcome.

Getting Started: Preprocessing — Titanic Dataset

‘Learning should be relevant and practical, not just passive and theoretical.’
~ John Dewey

We should learn things by practically doing it. So, in the account of that, we are going to preprocess Titanic dataset downloaded from Kaggle and make it ready for our analysis.

RMS Titanic ~ Unsplash

Have this in mind, before proceeding further we should always get a thorough understanding of our dataset and the what we are going to do on it.

Dataset: Basic Info

What is it about?

Our dataset has information about passengers of the RMS Titanic Ship. RMS Titanic sank in the early morning of 15 April 1912 in the North Atlantic Ocean, four days into the ship’s maiden voyage from Southampton to New York City. It ended as a tragic voyage.

What information it has?

The dataset has 12 columns related to passenger details, which are,

  • PassengerId: Passenger’s unique ID
  • Survived: Survival status of the passengers (0 = No; 1 = Yes)
  • Pclass: Passenger class (1 = First; 2 = Second; 3 = Third)
  • Name: Passenger’s name
  • Sex: Sex of the Passenger
  • Age: Age of the Passenger
  • SibSp: Number of siblings/spouses aboard
  • ParCh: Number of parents/children aboard
  • Ticket: Ticket number
  • Fare: Passenger fare
  • Cabin: Cabin
  • Embarked: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

What we do now?

We now know the overview of the columns and values that our dataset contains. Let’s head towards processing the data in steps.

Gearing Up!

It’s time for coding! Download the dataset from this link. This dataset has been published as a part of Kaggle competition. It has three .csv files, train.csv, test.csv and gender_submission.csv. We are going to work on train.csv data in this tutorial.

Open a new Jupyter Notebook (or any other IDE of your choice) to run our Python scripts. Import the packages needed.

Import the dataset

Firstly, import the packages needed to proceed further.

Read the dataset using Pandas read_csv() and store it in a variable named training_set, display the first few rows with head(), by default head() will return first 5 rows of the dataset, but you can specify any number of rows like head(10).

Dataset — RMS Titanic Survival

Check the dataset info

Let's check for the basic information about the dataset by running simple commands.

  • training_set.shape
    It returns a number of rows and columns in a dataset.
(891, 12)
  • training_set.columns
    It returns column headings.
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
  • training_set.isnull().sum()
    It returns a number of null values in each column.
PassengerId      0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

Preparing the dataset

From the overall understanding of the dataset, we gather so many insights which can help us in our journey,

Insights,

  • ‘Survived’ is the target variable, which we will predict once our preprocessing of our data is done. So, we retain that column.
  • Only the columns such as ‘Age’, ‘Cabin’ and ‘Embarked’ has missing values.
  • ‘PassengerId’, ‘Name’ and ‘Ticket’ doesn’t add much value in predicting the target variable.
  • ‘ParCh’(Parent/Children) and ‘SibSp’(Siblings/Spouse) details are related to family, so we can derive a new column named ‘Size of the family’
  • ‘Sex’, ‘Cabin’ and ‘Embarked’ are the categorical data that needs to be encoded to numerical values.

These are all the insights that I could gather in my view! Now we process the data in accordance with this information.

Dropping of columns

In this step, we are going to drop columns with the least priority. The column such as ‘PassengerId’ and ‘Ticket’ comes under this category. Use drop() to drop the columns.

Now, let’s run training_set.info(), and look at the status of our dataset.

> training_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(3)
memory usage: 69.7+ KB

We can see that, only ‘Cabin’, ‘Embarked’ and ‘Age’ column has missing values. Let’s work on that now.

Creating new classes

  • ‘Cabin’: Though Cabin column has 687 missing values, when you see carefully, it has a unique character at the beginning which denotes the deck number, therefore, we are going to create a column named Deck to extract this information, which may be used later in our prediction.
  • ‘ParCh’ and ‘SibSp’ are the details related to family size, so let’s derive a new column named ‘Size of the Family’.
  • ‘Name’: Instead of dropping right away, from the Name of the Passenger, we need to get only their Title

Now, let's drop Cabin, Name columns, we have extracted needed information from these two.

This is how our dataset looks like now.

Handling missing values

  • ‘Embarked’: Only two rows are missing the values for Embarked column.
    Embarked takes categorical values such as C = Cherbourg; Q = Queenstown; S = Southampton, here we can simply impute the missing values with most commonly occurred value, which is ‘S’ in this case.
  • ‘Age’: We are going the impute the missing values in the ‘Age’ column by taking the mean value in each group. Taking the mean value of the whole column can make the data inconsistent because there are several ranges in age.

Encoding categorical features

Many machine learning algorithms cannot support categorical values without being converted to numerical values. Fortunately, the python tools of pandas and sci-kit-learn provide several approaches to handle this situation.
They are,
— Find and Replace
— Label coding
— One hot encoding
— Custom Binary Encoding
— Using LabelEncoder from Sci-kit learn

Every method has its own advantage as well as disadvantages.

Initially, we are just going to map the categorical values into numerical data using map().

Manually replacing the categorical value is not the right choice if there are many categories.
Let’s do one conversion using LabelEncoder() provided by sklearn.preprocessing library.

This transforms the categorical data into numerical value.

Dataset ready…

Now our data is free from missing values, categorical data, and unwanted columns and ready to be used for further processing.

training_set.info()

Survived      891 non-null int64
Pclass 891 non-null int64
Sex 891 non-null int64
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Embarked 891 non-null int64
Title 891 non-null int64
FamilySize 891 non-null int64
Deck 891 non-null int64
dtypes: float64(2), int64(9)
memory usage: 76.6 KB

Applaud yourself on completing this! 😍
I hope that this article provides you the understanding of how to practically preprocess your data.
Transform data into insights! 💡✔

Other tutorials,
Would you like to appreciate my work? Buy Me A Cup Of Coffee! 😊

#100daysofMLcoding
End of Day #9. Happy Learning!

--

--

Bala Kowsalya

Passionate Computer Science Engineer. Curious. Technical Writer. Blogger. A Lifelong Learner. ❤ I wish to watch others learn things