Basic Steps for Analyzing the Titanic Dataset

Sogo Ogundowole
CodebagNG
Published in
7 min readOct 15, 2017

Data Science has made it easy to find out what, why and how things actually happened, and it is a means to ease ourselves from unnecessary worries and get the actual factor responsible for our this situations; how to control it(permit or prevent). This few steps helped me in analyzing the Titanic Dataset, pointing out factors responsible for the death or survival of individuals on the ship.

Tools I used:
Titanic Datasets: Train( .csv file)
Python 3
Jupyter notebook
Pandas
Seaborn
Matplotlib

Process Involved:
Set up your jupyter notebook, locate the directory of your of your dataset and start a kernel by selecting Python 3. Import pandas, seaborn and matplotlib

Note: I used Pclass and Class, Survived and Survival, df and data, Embarked and Port interchangeably. Please stick to anyone you’re comfortable with.

In [1]:

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:

data=pd.read_csv('../input/train.csv')

In [3]:

data.head()

In [4]:

data.isnull().sum() #checking for total null values

Out[4]:

PassengerId      0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Warming Up

the %matplotlib inline will make your plot outputs appear and be stored within the notebook. Access your file and assign it to a variable. Note that since the file is a .csv file then it has to be; read_csv(‘filename.csv’). I named my dataframe ‘df’ and the dataset I’m working on is ‘train.csv’ . Do not bother about the second dataframe as we’ll only be looking into the analysis of the dataset. Next, we checkout how our dataframe looks like;

df.sample(5)- #selects a 5 personds in df
df.head()- #first 5 people in df (you can specify the number you want between the brackets

df.tail()- #Last 5 people in df

Next, change the names of the columns to ease your understanding and work with the columns better:

Then we get the figures and percentages of those who died and survived the wreck;

survived = df[df[‘Survival’] == 1]died = df[df[‘Survival’] == 0]
print(‘Analysis of Survival’)
print(‘{0} survived the wreck’.format(survived.shape[0]))
print(‘{0} did not survive the wreck’.format(died.shape[0]))
survived_percent = round(((float(len(survived)/len(df))) * 100), 2)died_percent = round(((float(len(died)/len(df))) * 100), 2)
print(‘Survival Percentage = {0} %’.format(survived_percent))
print(‘Death Percentage = {0} %’.format(died_percent))
Survival vs Death

Types Of Features

Categorical Features:

A categorical variable is one that has two or more categories and each value in that feature can be categorised by them.For example, gender is a categorical variable having two categories (male and female). Now we cannot sort or give any ordering to such variables. They are also known as Nominal Variables.

Categorical Features in the dataset: Sex,Embarked.

Ordinal Features:
An ordinal variable is similar to categorical values, but the difference between them is that we can have relative ordering or sorting between the values. For eg: If we have a feature like Height with values Tall, Medium, Short, then Height is a ordinal variable. Here we can have a relative sort in the variable.

Ordinal Features in the dataset: Class

Continuous Feature:
A feature is said to be continuous if it can take values between any two points or between the minimum or maximum values in the features column.

Continuous Features in the dataset: Age

Analysing The Features
Out of 891 passengers in training set, only around 350 survived i.e Only 38.4% of the total training set survived the crash. We need to dig down more to get better insights from the data and see which categories of the passengers did survive and who didn’t.

We will try to check the survival rate by using the different features of the dataset. Some of the features being Sex, Port Of Embarcation, Age,etc.

First let us understand the different types of features.

Types Of Features

Categorical Features:
A categorical variable is one that has two or more categories and each value in that feature can be categorized by them. For example, gender is a categorical variable having two categories (male and female). Now we cannot sort or give any ordering to such variables. They are also known as Nominal Variables.

Categorical Features in the dataset: Sex, Port.

Ordinal Features:

An ordinal variable is similar to categorical values, but the difference between them is that we can have relative ordering or sorting between the values. For eg: If we have a feature like Height with values Tall, Medium, Short, then Height is a ordinal variable. Here we can have a relative sort in the variable.

Ordinal Features in the dataset: Class

Continuous Feature:

A feature is said to be continuous if it can take values between any two points or between the minimum or maximum values in the features column.

Continuous Features in the dataset: Age

Analyzing The Features
Sex → Categorical Feature

data.groupby(['Sex','Survived'])['Survived'].count()f,ax=plt.subplots(1,2,figsize=(18,8))
data[['Sex','Survived']].groupby(['Sex']).mean().plot.bar(ax=ax[0])
ax[0].set_title('Survived vs Sex')
sns.countplot('Sex',hue='Survived',data=data,ax=ax[1])
ax[1].set_title('Sex:Survived vs Dead')
plt.show()

The number of men on the ship is lot more than the number of women. Still the number of women saved is almost twice the number of males saved. The survival rates for a women on the ship is around 75% while that for men in around 18–19%.

This looks to be a very important feature for modeling. But is it the best?? Lets check other features.

Class → Ordinal Feature

pd.crosstab(data.Class,data.Survival,margins=True).style.background_gradient(cmap='summer_r')f,ax=plt.subplots(1,2,figsize=(18,8))
data['Class'].value_counts().plot.bar(color=['#CD7F32','#FFDF00','#D3D3D3'],ax=ax[0])
ax[0].set_title('Number Of Passengers By Class')
ax[0].set_ylabel('Count')
sns.countplot('Class',hue='Survival',data=df,ax=ax[1])
ax[1].set_title('Pclass:Survived vs Dead')
plt.show()

We can clearly see that Passengers Of Class 1 were given a very high priority while rescue. Even though the the number of Passengers in Class 3 were a lot higher, still the number of survival from them is very low, somewhere around 25%.

For Class 1 %survived is around 63% while for Class 2 is around 48%
Checking survival rate with Sex and Pclass Together.

pd.crosstab([data.Sex,data.Survived],data.Pclass,margins=True).style.background_gradient(cmap='summer_r')sns.factorplot('Pclass','Survived',hue='Sex',data=data)
plt.show()

Looking at the FactorPlot, we can easily infer that survival for Women from Pclass1 is about 95–96%, as only 3 out of 94 Women from Pclass1 died.

It is evident that irrespective of Pclass, Women were given first priority while rescue. Even Men from Pclass1 have a very low survival rate.

Looks like Pclass is also an important feature. Lets analyse other features.

Age → Continuous Feature

print('Oldest Passenger was of:',data['Age'].max(),'Years')
print('Youngest Passenger was of:',data['Age'].min(),'Years')
print('Average Age on the ship:',data['Age'].mean(),'Years')
Oldest Passenger was of: 80.0 Years
Youngest Passenger was of: 0.42 Years
Average Age on the ship: 29.69911764705882 Years
f,ax=plt.subplots(1,2,figsize=(18,8))
sns.violinplot("Pclass","Age", hue="Survived", data=data,split=True,ax=ax[0])
ax[0].set_title('Pclass and Age vs Survived')
ax[0].set_yticks(range(0,110,10))
sns.violinplot("Sex","Age", hue="Survived", data=data,split=True,ax=ax[1])
ax[1].set_title('Sex and Age vs Survived')
ax[1].set_yticks(range(0,110,10))
plt.show()

Observations:

1)The number of children increases with Pclass and the survival rate for passengers below Age 10(i.e children) looks to be good irrespective of the Pclass.

2)Survival chances for Passengers aged 20–50 from Pclass1 is high and is even better for Women.

3)For males, the survival chances decreases with an increase in age.

As we had seen earlier, the Age feature has 177 null values. To replace these NaN values, we can assign them the mean age of the dataset.

But the problem is, there were many people with many different ages. We just cant assign a 4 year kid with the mean age that is 29 years. Is there any way to find out what age-band does the passenger lie??

We can check the Name feature. Looking upon the feature, we can see that the names have a salutation like Mr or Mrs. Thus we can assign the mean values of Mr and Mrs to the respective groups.

‘’What’s In A Name??’’ — -> Feature

data['Initial']=0
for i in data:
data['Initial']=data.Name.str.extract('([A-Za-z]+)\.') #lets extract the Salutations

Okay so here we are using the Regex: [A-Za-z]+).. So what it does is, it looks for strings which lie between A-Z or a-z and followed by a .(dot). So we successfully extract the Initials from the Name.

pd.crosstab(data.Initial,data.Sex).T.style.background_gradient(cmap='summer_r') #Checking the Initials with the Sex

InitialCaptColCountessDonDrJonkheerLadyMajorMasterMissMlleMmeMrMrsMsRevSirSexfemale

Okay so there are some other titles like Mlle or Mme that stand for Miss. I will replace them with Miss and same thing for other values.

data['Initial'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)data.groupby('Initial')['Age'].mean() #lets check the average age by InitialsInitial
Master 4.574167
Miss 21.860000
Mr 32.739609
Mrs 35.981818
Other 45.888889
Name: Age, dtype: float64

Filling NaN Ages

So this assigns the average ages to the following initials

## Assigning the NaN Values with the Ceil values of the mean ages
data.loc[(data.Age.isnull())&(data.Initial=='Mr'),'Age']=33
data.loc[(data.Age.isnull())&(data.Initial=='Mrs'),'Age']=36
data.loc[(data.Age.isnull())&(data.Initial=='Master'),'Age']=5
data.loc[(data.Age.isnull())&(data.Initial=='Miss'),'Age']=22
data.loc[(data.Age.isnull())&(data.Initial=='Other'),'Age']=46

I

data.Age.isnull().any() #So no null values left finally

Out[19]:

False

We have another plot for verification

f,ax=plt.subplots(1,2,figsize=(20,10))
data[data['Survived']==0].Age.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('Survived= 0')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
data[data['Survived']==1].Age.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('Survived= 1')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()

Observations:

1)The Toddlers(age<5) were saved in large numbers(The Women and Child First Policy).

2)The oldest Passenger was saved(80 years).

3)Maximum number of deaths were in the age group of 30–40.

In [21]:

sns.factorplot('Pclass','Survived',col='Initial',data=data)
plt.show()

The Women and Child first policy thus holds true irrespective of the class.

With this this few basic lines we’ve been able to analyze the Titanic data set. Thanks for reading.

--

--