Understanding Data Preprocessing taking the Titanic Dataset.

Published in

All about Machine Learning!

7 min readSep 6, 2020

Source : Google — Thanks for existing Google.

What is Data Pre-Processing?

We know from my last blog that data preprocessing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method of resolving such issues. Data preprocessing prepares raw data for further processing.

So in this blog we will learn about the implementation of data pre-processing on a data set. I have decided to do my implementation using the Titanic data set, which I have downloaded from Kaggle. Here is the link to get this dataset- https://www.kaggle.com/c/titanic-gettingStarted/data

Note- Kaggle gives 2 datasets, the train and the test dataset, so we will use both of them in this process.

What is the expected outcome?

The Titanic shipwreck was a massive disaster, so we will implement data pre- processing on this data set to know the number of survivors and their details.

I will show you how to apply data preprocessing techniques on the Titanic dataset, with a tinge of my own ideas into this.

So let’s get started…

Importing all the important libraries

Firstly after loading the data sets in our system, we will import the libraries that are needed to perform the functions. In my case I imported NumPy, Pandas and Matplot libraries.

#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Importing dataset using Pandas

To work on the data, you can either load the CSV in excel software or in pandas. So I will load the CSV data in pandas. Then we will also use a function to view that data in the Jupyter notebook.

#importing dataset using pandas
df = pd.read_csv(r’C:\Users\KIIT\Desktop\Internity Internship\Day 4 task\train.csv’)
df.shape
df.head()

#Taking a look at the data format below
df.info()

Let’s take a look at the data output that we get from the above code snippets :

If you carefully observe the above summary of pandas, there are total 891 rows, Age shows only 714 (means missing), Embarked (2 missing) and Cabin missing a lot as well. Object data types are non-numeric so we have to find a way to encode them to numerical values.

Viewing the columns in the particular dataset

We use a function to view all the columns that are being used in this dataset for a better reference of the kind of data that we are working on.

#Taking a look at all the columns in the data set
print(df.columns)

Defining values for independent and dependent data

Here we will declare the values of X and y for our independent and dependent data.

#independet data
X = df.iloc[:, 1:-1].values
#dependent data
y = df.iloc[:, -1].values

Dropping Columns which are not useful

Lets try to drop some of the columns which many not contribute much to our machine learning model such as Name, Ticket, Cabin etc.

So we will drop 3 columns and then we will take a look at the newly generated data.

#Dropping Columns which are not usefull, so we drop 3 of them here according to our convenience
cols = [‘Name’, ‘Ticket’, ‘Cabin’]
df = df.drop(cols, axis=1)
#Taking a look at the newly formed data format below
df.info()

Dropping rows having missing values

Next if we want we can drop all rows in the data that has missing values (NaN). You can do it like the code shows-

#Dropping the rows that have missing values
df = df.dropna()
df.info()

Problem with dropping rows having missing values

After dropping rows with missing values we find that the dataset is reduced to 712 rows from 891, which means we are wasting data. Machine learning models need data for training to perform well. So we preserve the data and make use of it as much as we can. We will see it later.

Creating Dummy Variables

Now we convert the Pclass, Sex, Embarked to columns in pandas and drop them after conversion.

#Creating Dummy Variables
dummies = []
cols = [‘Pclass’, ‘Sex’, ‘Embarked’]
for col in cols:
dummies.append(pd.get_dummies(df[col]))
titanic_dummies = pd.concat(dummies, axis=1)

So on seeing the information we know we have 8 columns transformed to columns where 1,2,3 represents passenger class.

And finally we concatenate to the original data frame column wise.

#Combining the original dataset
df = pd.concat((df,titanic_dummies), axis=1)

Now that we converted Pclass, Sex, Embarked values into columns, we drop the redundant same columns from the data frame and now take a look at the new data set.

df = df.drop([‘Pclass’, ‘Sex’, ‘Embarked’], axis=1)
df.info()

Taking Care of Missing Data

All is good, except age which has lots of missing values. Lets compute a median or interpolate() all the ages and fill those missing age values. Pandas has a interpolate() function that will replace all the missing NaNs to interpolated values.

#Taking care of the missing data by interpolate function
df[‘Age’] = df[‘Age’].interpolate()
df.info()

Now lets observe the data columns. Notice age which is interpolated now with imputed new values.

Converting the data frame to NumPy

Now that we have converted all the data to numeric, its time for preparing the data for machine learning models. This is where scikit and numpy come into play:

X = Input set with 14 attributes
y = Small y Output, in this case ‘Survived’

Now we convert our dataframe from pandas to numpy and we assign input and output.

#using the concept of survived vlues, we conver and view the dataframe to NumPy
X = df.values
y = df[‘Survived’].values
X = np.delete(X, 1, axis=1)

Dividing data set into training set and test set

Now that we are ready with X and y, lets split the dataset for 70% Training and 30% test set using scikit model_selection like in code and the 4 print functions after that-

#Dividing data set into training set and test set (Most important step)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Feature Scaling

Feature Scaling is an important step of data preprocessing. Feature Scaling makes all data in such way that they lie in same scale usually -3 to +3.

In out data set some field have small value and some field have large value. If we apply out machine learning model without feature scaling then prediction our model have high cost(It does because small value are dominated by large value). So before apply model we have to perform feature scaling.

We can perform feature scaling in two ways.

I-:Standardizaion x=(x-mean(X))/standard deviation(X)

II-:Normalization-: x=(x-min(X))/(max(X)-min(X))

#Using the concept of feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:,3:] = sc.fit_transform(X_train[:,3:])
X_test[:,3:] = sc.transform(X_test[:,3:])