Exploratory Data Analysis of Titanic Survival Problem

Part I — Analysis, Cleaning & Visualization

Revathi Suriyadeepan
Analytics Vidhya
11 min readDec 30, 2020

--

EDA is a statistical approach for visualizing and analyzing data before making up a hypothesis or modeling.

This is Part I of my series on Getting Started with Data Science.

In this write-up, we will explore Titanic — Machine Learning from Disaster competition from Kaggle. I present a summary of my analysis and insights in this blog. A detailed notebook containing a comprehensive study of the Titanic data is available at github — reyscode/start-datascience.

Table of Contents

  1. Study of Variables
  2. Data Cleaning
  3. Feature Engineering
  4. Correlation Study
  5. Target Variable Analysis (Univariate Analysis)
  6. Bivariate Analysis
  7. Multivariate Analysis

Study of Variables

First, we will import the necessary packages and load the data set.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('../data/titanic-train.csv')
df

Above is the training dataset of the titanic survival problem. It has 891 rows (number of passengers), and 12 columns (data about the passenger) including the target variable “Survived”.

Let us first look at the columns of the data and then we use describe() and info() methods to get a basic idea of what we have in hand.

cols = df.columns
cols
df.describe()

Let us look at the data row-wise and try to understand.

Count: The first-row ‘count’ is the number of entries in that particular column ‘PassengerID’. It shows 891(equal to the number of passengers), which means ‘PassengerID’ is available for all the passengers. Except ‘Age’, all other data are available for each passenger. We gotta do something to fix the missing values in the ‘Age’ column.

Mean: Mean of all data. Let's consider the ‘Fare’ column. The average ‘Fare’ of the passenger is USD 32.20

Std: Standard deviation of all data. A low standard deviation means that most of the numbers are close to the average, while a high standard deviation means that the numbers are more spread out. Again ‘Fare’ has a high standard deviation.

Min: Minimum value of the column. For example, the lowest ‘Fare’ shows USD 0, which means ‘Fare’ is unavailable for some passengers. The model may not perform well with 0 for some passengers. Hence we will need to focus on the ‘Fare’ column before modeling.

Max: Maximum value of the column. For example, the highest ‘Fare’ shows USD 512.33. The mean of the ‘Fare’ column says USD 32.20. We see there is a huge difference in fare. It could be because of the ‘Pclass’ they were traveling.

25%, 50% & 75%: 1st, 2nd, and 3rd quartile of the data. Quartile in statistics is a type of quantile that divides the number of ordered data points into four equal groups. 1st quartile is the middle number between the smallest number and the median. 2nd quartile is the median of the dataset. 3rd quartile is the middle value between the median and the highest value.

df.info()

Using the info() method, we get the dtypes of the dataset, Non-null count, and memory usage. To get the total null values of the data we have to use isnull() method.

df.isnull().sum()

We already know that there are 177 missing values in the age column. From the above results, we see that there are 687 missing values in ‘Cabin’ and 2 missing values in ‘Embarked’. We need to fix these null values before we move on to modeling.

Now let's see the heatmap of the null values.

sns.heatmap(df.isnull(), cmap = 'viridis', cbar=False)

Since the ‘Embarked’ column has only 2 null values, it's not visible in the heatmap.

Data Cleaning

Since the ‘Cabin’ column has got more NaN values, let's fix it first. The cabin column has the cabin number of the passenger or NaN for those who didn’t have one. Let's create a new column ‘HasCabin’ which has 1 if there is a cabin and 0 for NaN.

def create_feat_has_cabin(df, colname):
# if NA => 0 else 1
def _is_nan(x):
if isinstance(x, type(np.nan)):
return 0
return 1

return df[colname].apply(_is_nan)

df['HasCabin'] = create_feat_has_cabin(df, 'Cabin')

Now, let's fill NA values of the Embarked column with ‘S’(Southampton)

def fill_na_embarked(df, colname):

return df[colname].fillna('S')

df['Embarked'] = fill_na_embarked(df, 'Embarked')

Similarly, the Age column has a lot of missing values. Hence we fill the missing values with random values centered around mean and (spread out) distributed with standard deviation, sd. Let's get the mean and standard deviation first.

mean = df['Age'].mean()
sd = df['Age'].std()
print(mean,sd)

The mean of the dataset is 29.48 and the standard deviation of the dataset is 13.53. Hence we fill the missing values by choosing a random number between 16 and 43.

def fill_na_age(df, colname):
mean = df['Age'].mean()
sd = df['Age'].std()
def fill_empty(x):
if np.isnan(x):
return np.random.randint(mean-sd, mean+sd, ())
return x
return df[colname].apply(fill_empty).astype(int)
df['Age'] = fill_na_age(df, 'Age')

Feature Engineering

We have filled all the missing values in our data. In this section, we put on our creative hats and think up new features that could help our yet-be-built model’s performance.

First, let's create a new column ‘FamilySize’ by combining ‘SibSp’(Sibling & Spouse) and ‘Parch’(Parent & Children).

def create_feat_familly_size(df):
return df['SibSp'] + df['Parch'] + 1

df['FamilySize'] = create_feat_familly_size(df)

Ok done!

What about the ones traveling solo? We create a new column named ‘IsAlone’ with 0 and 1.

def create_feat_isalone(df, colname):
def _is_alone(x):
if x==1:
return 1
return 0

return df[colname].apply(_is_alone)

df['IsAlone'] = create_feat_isalone(df, 'FamilySize')

As we have seen earlier, the Fare column contains 0’s for some passengers and notoriously high values for some passengers. So lets split the fare into four categories and store it in a new column “CategoricalFare”

def create_feat_categoricalFare(df, colname):
return pd.qcut(df[colname], 4, labels = [0, 1, 2, 3]).astype(int)
df['CategoricalFare'] = create_feat_categoricalFare(df, 'Fare')

We have already filled in the missing values of Age. Now let's categorize the age into 5 categories and store it into a new column “CategoricalAge”

def create_feat_categoricalAge(df, colname):
return pd.qcut(df[colname], 5, labels = [0, 1, 2, 3, 4]).astype(int)
df['CategoricalAge'] = create_feat_categoricalAge(df, 'Age')

Done!

Now let's look at the Name column. We have many titles [‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’, ‘Don’, ‘Rev’, ‘Dr’, ‘Mme’, ‘Ms’, ‘Major’, ‘Lady’, ‘Sir’, ‘Mlle’, ‘Col’, ‘Capt’, ‘Countess’, ’Jonkheer’]. So let's extract the title from each name and categorize the titles into four categories namely Mr, Miss, Mrs & Rare, and store it in a new column ‘Title’.

import redef create_feat_title(df, colname):
def find_title(x):
title_search = re.search(' ([A-Za-z]+)\.', x)
if title_search:
title = title_search.group(1)
if title in ['Mlle', 'Ms']:
return 'Miss'
elif title in ['Mme', 'Mrs']:
return 'Mrs'
elif title=='Mr':
return 'Mr'
else:
return 'Rare'
return ""

return_title= df[colname].apply(find_title)
dict_title = {'Miss': 1, 'Mrs':2, 'Mr':3, 'Rare':4}
return return_title.replace(dict_title)
df['Title'] = create_feat_title(df, 'Name')

Now we change the ‘Sex’ column values of ‘Male’ and ‘Female’ to 1 and 0 and store it in a new column ‘SexNumerical’. But why? Machine Learning algorithms operate in the domain of numerical values. They do not understand “Male/Female” or “Yes/No”. But they understand the difference between a 0 and a 1.

For the same reason, we might as well change the values of the Embarked column to numerical.

def create_feat_sex(df, colname):
def sex(x):
if x=='male':
return 1
return 0

return df[colname].apply(sex)

df['SexNumerical'] = create_feat_sex(df, 'Sex')
df['Embarked'] = df.Embarked.replace({'S': 0, 'C' : 1, 'Q' : 2})

Alright, we are done with Data Cleaning and Feature Engineering. Let's check if there are any more null values present in the data frame.

df.isna().sum()

We can ignore the missing value in the Cabin since we have already created a new column ‘HasCabin’. Time to drop the useless columns.

drop_list = ['PassengerId', 'Cabin', 'Ticket', 'SibSp', 'Name']
titanic = df.drop(drop_list, axis=1)

Correlation Study

We are done with Data Cleaning and Pre-processing.

Before visualizing the data, let's see the correlation between the variables.

corrmat = titanic.corr()
corrmat

Positive and Negative values denote Positive and Negative correlation. The first row of the data shows the correlation of each variable with the Target variable ‘Survived’.

For building a good predictive model, we are interested in variables that influence the target variable “Survived”. Positively or negatively. We need to consider the values that are both too high and too low.

Let us take a look at the beautiful correlation heatmap rendered using the amazing and succinct seaborn.

colormap = plt.cm.Blues
plt.figure(figsize=(14,10))
sns.heatmap(titanic.corr(), cmap=colormap, annot=True, linewidths=0.2)

The first row contains the values that represent the correlation of each variable with the target variable. ‘HasCabin’ and ‘CategoricalFare’ are highly (positively) correlated with the target variable and ‘Sex Numerical’ is inversely correlated with the target variable.

Target Variable Analysis (Univariate Analysis)

The Study of the target variable is a significant step in Data Analysis that reveals the nature and distribution of the variable. Let’s analyze our target variable “Survived”.

titanic['Survived'].value_counts()

From the above result, 342 out of 891 passengers in the training data are survived. Let's plot it using count plot.

sns.countplot('Survived', data=titanic)
plt.title("Titanic Survived")
plt.show()

From the above plot, the number of people who survived is less than the number of people who are died. Now let's see what percentage of passengers survived using a pie plot.

explode = [0, 0.05]
titanic['Survived'].value_counts().plot.pie(autopct = '%1.2f%%', explode=explode)

From the above chart, 38% of the passengers are only survived based on this training data. Clearly, there is an imbalance between the classes.

Bivariate Analysis

Let's analyze the ‘Pclass’ column since it is highly correlated with the target variable.

titanic['Pclass'].value_counts()
titanic.groupby(['Pclass', 'Survived'])['Survived'].count()

The above result shows the breakup of passengers based on Pclass and Survived. But still, we cannot see the percentage of survival with this data. So let's plot Pclass along with the Survived to have a better picture of the data.

sns.catplot(x='Pclass', y='Survived', data=titanic, kind='point')

What you see above is called a Point plot. It shows point estimates and confidence intervals. The point estimates indicate the central tendency of a variable while the confidence intervals indicate the uncertainty around this estimate. From the above plot, it is very clear that the first Class passengers had the highest Survival rate when compared to the other class passengers.

Let's see one more example of bivariate analysis by comparing Sex and Fare.

sns.catplot(x='Sex', y='Fare', data=titanic, kind='boxen')

The enhanced box plot shown above indicates that the fare of “Female” passengers is on average higher than male passengers. It could be because of the additional services offered to female passengers.

Multivariate Analysis

Multivariate Analysis helps us in mining for a deeper understanding of the relationship between variables when compared to Bivariate Analysis. The latter assumes that the relationship between a variable X and the target variable Y is independent of the rest of the variables, (i.e) f(X, Y) doesn't depend on a third variable Z. This limiting assumption could be dangerous. For instance, “Women and children first” is a naval code of conduct followed since 1852, whereby the lives of women and children were to be saved first in a life-threatening situation. As we already know, “Survival” is highly correlated with “Sex”. But a third variable “Age” (child) influences the relationship between “Survival” and “Sex”.

First, let us see analyze the data with three variables, and then we will learn to model the relationship between four variables. Let's compare ‘Sex’ and ‘Age’.

sns.catplot(x='Sex', y='Age', data=titanic)

From the above graph, we can see that some of the very old men were traveling. But we couldn't get much information by comparing age and sex.

So let's include the third parameter “Pclass” and try to understand it better.

sns.catplot(x='Sex', y='Age', data=titanic, kind='box', hue='Pclass')

From the above plot, we infer that most of the older people were traveling in first class. It may be because they were rich. The youngsters who are aged between 25 and 35 were mostly traveling in second and third classes.

Maybe we can see it better using a violin plot.

sns.catplot(x='Pclass', y='Age', data=titanic, kind='violin', hue='Sex')

Now let's see how to compare four variables. Let's take ‘Age’ and ‘Fare’. Note that both are continuous variables.

sns.jointplot(x='Age', y='Fare', data=titanic, kind='hexx')

The average passengers were aged between 20 and 40, the average Fare is about $20 to $50. We couldn't understand much from the previous plot. I present below a better view of the data that includes “Sex” and “Pclass” along with “Age” and “Fare”.

sns.relplot(x='Age', y='Fare', data=titanic, row='Sex', col='Pclass')

From the above plots, we observe that there were more male passengers who traveled in first-class than women passengers. The Fare for first-class female passengers was higher than male passengers. There is no big difference in Fare for second and third-class passengers. Very few children traveled in first class. The third class had more children. Most of the second and third-class passengers were aged between 20 and 40.

— — — — — — — — — — — — — — — — — — — — — — — — — —

That pretty much wraps up this write-up. What you see above is a summary of a comprehensive analysis I’ve done over the past few weeks. The complete notebook is available at reyscode/start-datascience. During this period I’ve come a long way in using statistics and data visualization to understand relationships among variables in the data. I’m glad to admit that I feel comfortable using Statistics to understand patterns in data. I’m looking forward to using more advanced statistical methods (ARIMA, …) to derive insights from more complex data with entangled variables.

Please take a look at my follow-up blogs:

If you find this interesting, please leave a comment below. I’d be delighted to interact with you.

--

--

Revathi Suriyadeepan
Analytics Vidhya

Aspiring Data Scientist, actively looking for job opportunities...