[Kaggle] Titanic Survival Prediction — Top 3%

Tim Chan
Analytics Vidhya
Published in
13 min readNov 27, 2019
(Source: https://www.britannica.com/story/the-unsinkable-titanic)

After studying one month of Python, I plan to work on projects to apply my knowledge. Kaggle is a great platform which holds machine learning competition and provides real-world datasets. As my first attempt, I have spent 10 days in total for this project. Thanks to online resources such as Stackoverflow and articles from Medium which help a lot!

Index

  1. Background
  2. Exploratory Data Analysis
  3. Imputation of Missing Data/ Outliers
  4. Data Transformation
  5. Feature Creation
  6. Feature Selection
  7. Model
  8. Submission

1. Background

The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Overview

The data has been split into two groups:

  • training set (train.csv)
  • test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary

VariableDefinitionKeysurvivalSurvival0 = No, 1 = YespclassTicket class1 = 1st, 2 = 2nd, 3 = 3rdsexSexAgeAge in yearssibsp# of siblings / spouses aboard the Titanicparch# of parents / children aboard the TitanicticketTicket numberfarePassenger farecabinCabin numberembarkedPort of EmbarkationC = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

Goal

It is your job to predict if a passenger survived the sinking of the Titanic or not.
For each in the test set, you must predict a 0 or 1 value for the variable.

Metric

Your score is the percentage of passengers you correctly predict. This is known as accuracy.

2. Exploratory Data Analysis

Import libraries:

import numpy as np
import os
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings(‘ignore’)

Import data:

train_data = pd.read_csv(‘train.csv’)
test_data = pd.read_csv(‘test.csv’)

Data structure:

train_data.info()Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
test_data.info()Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 0 to 417
Data columns (total 13 columns):
Age 332 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
Fare 417 non-null float64
Name 418 non-null object
Parch 418 non-null int64
PassengerId 418 non-null int64
Pclass 418 non-null int64
Sex 418 non-null object
SibSp 418 non-null int64
Survived 0 non-null float64
Ticket 418 non-null object
Title 418 non-null object
dtypes: float64(3), int64(4), object(6)
memory usage: 45.7+ KB
  1. Total no. of rows: 891 for train and 418 for test dataset.
  2. Cabin: data missing > 70% .
    Since data with less than 30% can’t provide meaning information, Cabin data can be ignored.
train_data.drop([‘Cabin’], axis=1, inplace=True)
test_data.drop([‘Cabin’], axis=1, inplace=True)

Survived

sns.countplot(train_data.Survived)
plt.show()

Overall probability of survivial ~38%

Pclass

sns.countplot(train_data.Pclass)
plt.show()
sns.barplot(x=’Pclass’, y=’Survived’, data=train_data)
plt.show()

1. Passengers in Pclass 1 (Upper class) are more likely to survive.

2. Pclass is a good feature for prediction of survival.

Sex

sns.countplot(train_data.Sex)
plt.show()
sns.barplot(x=’Sex’, y=’Survived’, data=train_data)
plt.show()

1. Proportion of male and female: ~2/3 vs ~1/3

2. Male is much less likely to survive, with only 20% chance of survival. For female, >70% chance of survival.

3. Obviously, Sex is an important feature to predict survival.

Age

plt.hist(train_data.Age, edgecolor=’black’)
plt.xlabel('Age')
plt.ylabel('count')
plt.show()
sns.boxplot(x=’Survived’, y=’Age’, data=train_data)
plt.show()

1. Passengers are mainly aged 20–40.

2. Younger passengers tends to survive.

SibSp

sns.countplot(train_data.SibSp)
plt.show()
sns.barplot(x=’SibSp’, y=’Survived’, data=train_data)
plt.show()

1. Most of the passengers travel with 1 sibling/spouse.

2. Passengers having 1 sibling/spouse are more likely to survive compared to those not.

3. For those more than 1 siblings/spouses, the information is insufficient to provide any insight.

Parch

sns.countplot(train_data.Parch)
plt.show()
sns.barplot(x=’Parch’, y=’Survived’, data=train_data)
plt.show()

1. >70% passengers travel without parents/children.

2. Passengers travelling with parents/children are more likely to survive than those not.

Ticket

train_data.Ticket.head(10)Output:
0 A/5 21171
1 PC 17599
2 STON/O2. 3101282
3 113803
4 373450
5 330877
6 17463
7 349909
8 347742
9 237736
Name: Ticket, dtype: object

Fare

sns.distplot(train_data.Fare)
plt.show()
  1. The distribution is right-skewed. Outliers are observed.
  2. For those who survived, their fares are relatively higher.

Embarked

sns.countplot(train_data.Embarked)
plt.show()
sns.barplot(x=’Embarked’, y=’Survived’, data=train_data)
plt.show()

1. >2/3 passengers embarked at Port C.

2. Passengers embarked at Port C are more likely to survive.

3. Imputation of Missing Data/ Outliers

Age

train_data.Name.head(10)Output:
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th…
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
5 Moran, Mr. James
6 McCarthy, Mr. Timothy J
7 Palsson, Master. Gosta Leonard
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9 Nasser, Mrs. Nicholas (Adele Achem)
Name: Name, dtype: object

Let’s extract the titles (Mr./Mrs./Miss/Master) from the names of passengers. This can be done for both train and test datasets.

whole_data = train_data.append(test_data)
whole_data['Title'] = whole_data.Name.str.extract(r'([A-Za-z]+)\.', expand=False)
whole_data.Title.value_counts()
Output:
Mr 757
Miss 260
Mrs 197
Master 61
Dr 8
Rev 8
Col 4
Ms 2
Major 2
Mlle 2
Countess 1
Jonkheer 1
Don 1
Sir 1
Dona 1
Capt 1
Mme 1
Lady 1
Name: Title, dtype: int64

The common titles are(Mr/Miss/Mrs/Master). Some of the titles (Ms/Lady/Sir…etc.) can be grouped to the common titles. The remaining unclassified titles can be frouped to “Others”.

Common_Title = [‘Mr’, ‘Miss’, ‘Mrs’, ‘Master’]
whole_data[‘Title’].replace([‘Ms’, ‘Mlle’, ‘Mme’], ‘Miss’, inplace=True)
whole_data[‘Title’].replace([‘Lady’], ‘Mrs’, inplace=True)
whole_data[‘Title’].replace([‘Sir’, ‘Rev’], ‘Mr’, inplace=True)
whole_data[‘Title’][~whole_data.Title.isin(Common_Title)] = ‘Others’

Let’s look at the relationship between titiles and age in train dataset.

train_data = whole_data[:len(train_data)]
test_data = whole_data[len(train_data):]
sns.boxplot(x='Title', y='Age', data=train_data)
plt.show()

Find the median of Age in each title.
(Remarks: only use train dataset to avoid data leakage)

AgeMedian_by_titles = train_data.groupby(‘Title’)[‘Age’].median()
AgeMedian_by_titles
Output:
Title
Master 3.5
Miss 21.5
Mr 30.0
Mrs 35.0
Others 47.0
Name: Age, dtype: float64

Impute the missing Age values according to the titles.

for title in AgeMedian_by_titles.index:
train_data['Age'][(train_data.Age.isnull()) & (train_data.Title == title)] = AgeMedian_by_titles[title]
test_data['Age'][(test_data.Age.isnull()) & (test_data.Title == title)] = AgeMedian_by_titles[title]

Embarked

For train dataset, there are only 2 missing values. Simply impute the mode.

train_data[‘Embarked’].fillna(train_data.Embarked.mode()[0], inplace=True)

Fare

For test dataset, there is only 1 missing value. Simply impute the median,

test_data[‘Fare’].fillna(test_data[‘Fare’].median(), inplace=True)

For train dataset, there 3 outliers (i.e. 512.3292).

For train dataset, there are outliers observed. Replace them with median.train_data.Fare.sort_values(ascending=False).head(5)Output:
679 512.3292
258 512.3292
737 512.3292
341 263.0000
438 263.0000

Outliers should be handled in order not to distort the distribution and thus make the model more robust.

Outliers can be replaced with maximum cap, median or you can simply remove them.

I choose to replace the outliers the 2nd higher fare (i.e. 263).

train_data.loc[train_data.Fare>512, ‘Fare’] = 263
train_data.Fare.sort_values(ascending=False).head(5)
Output:
341 263.0
438 263.0
88 263.0
679 263.0
258 263.0
Name: Fare, dtype: float64

Check for missing data.

train_data.info()Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
Age 891 non-null float64
Embarked 891 non-null object
Fare 891 non-null float64
Name 891 non-null object
Parch 891 non-null int64
PassengerId 891 non-null int64
Pclass 891 non-null int64
Sex 891 non-null object
SibSp 891 non-null int64
Survived 891 non-null float64
Ticket 891 non-null object
Title 891 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 90.5+ KB
test_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 0 to 417
Data columns (total 12 columns):
Age 418 non-null float64
Embarked 418 non-null object
Fare 418 non-null float64
Name 418 non-null object
Parch 418 non-null int64
PassengerId 418 non-null int64
Pclass 418 non-null int64
Sex 418 non-null object
SibSp 418 non-null int64
Survived 0 non-null float64
Ticket 418 non-null object
Title 418 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 42.5+ KB

4. Data Transformation

Encode string to numbers for modelling.

Sex

train_data[‘Sex_Code’] = train_data[‘Sex’].map({‘female’:1, ‘male’:0}).astype(‘int’)
test_data[‘Sex_Code’] = test_data[‘Sex’].map({‘female’:1, ‘male’:0}).astype(‘int’)

Embarked

train_data['Embarked_Code'] = train_data['Embarked'].map({'S':0, 'C':1, 'Q':2}).astype('int')
test_data['Embarked_Code'] = test_data['Embarked'].map({'S':0, 'C':1, 'Q':2}).astype('int')

Group data into bins to make the model more robust and avoid over-fitting.

Age

train_data['AgeBin_5'] = pd.qcut(train_data['Age'], 5)
test_data['AgeBin_5'] = pd.qcut(test_data['Age'], 5)
sns.barplot(x='AgeBin_5', y='Survived', data=train_data)
plt.show()

Fare

train_data[‘FareBin_5’] = pd.qcut(train_data[‘Fare’], 5)
test_data[‘FareBin_5’] = pd.qcut(test_data[‘Fare’], 5)

Encode the Age and Fare bins into numbers for modelling.

label = LabelEncoder()
train_data['AgeBin_Code_5'] = label.fit_transform(train_data['AgeBin_5'])
test_data['AgeBin_Code_5'] = label.fit_transform(test_data['AgeBin_5'])
label = LabelEncoder()
train_data[‘FareBin_Code_5’] = label.fit_transform(train_data[‘FareBin_5’])
test_data[‘FareBin_Code_5’] = label.fit_transform(test_data[‘FareBin_5’])

5. Feature Creation

Alone

SibSp and Parch are both related to family members. for simple sake, I decided to combined them into a single feature namely FamilySize.

train_data[‘FamilySize’] = train_data.SibSp + train_data.Parch + 1
test_data[‘FamilySize’] = test_data.SibSp + test_data.Parch + 1
sns.countplot(train_data.FamilySize)
plt.show()

Since the proportion of FamilySize=1 is dominant, it may not provide sufficeint predictive power. I decided to group them and convert to travelling alone or not.

train_data[‘Alone’] = train_data.FamilySize.map(lambda x: 1 if x == 1 else 0)
test_data[‘Alone’] = test_data.FamilySize.map(lambda x: 1 if x == 1 else 0)
sns.countplot(train_data.Alone)
plt.show()
sns.barplot(x=’Alone’, y=’Survived’, data=train_data)
plt.show()

It is observed that travelling alone is less likely to survive (~30% vs ~50%).

Title

Title is created for the imputation missing values of Age. It can also be used as a new features.

sns.countplot(train_data.Title)
plt.show()
sns.barplot(x=’Title’, y=’Survived’, data=train_data)
plt.show()

It is obviously that Title Mr. is much less likely to survive compared to others .

Let’s encode the features for modelling.

train_data[‘Title_Code’] = train_data.Title.map({‘Mr’:0, ‘Miss’:1, ‘Mrs’:2, ‘Master’:3, ‘Others’:4}).astype(‘int’)
test_data[‘Title_Code’] = test_data.Title.map({‘Mr’:0, ‘Miss’:1, ‘Mrs’:2, ‘Master’:3, ‘Others’:4}).astype(‘int’)

Connected Survival

From the Titanic movie, those survived were often in family groups. They helped each other to find way out. In addition, families usually have children which are the first priority to sent to safe boats. Of course, there should be parents to take care of their children.

To find out family groups, apart from surnames of passenges (there may be same surnames but different families), let’s also look at Ticket.

train_data[[‘Name’, ‘Ticket’]].sort_values(‘Name’).head(20)Output:
845 Abbing, Mr. Anthony C.A. 5547
746 Abbott, Mr. Rossmore Edward C.A. 2673
279 Abbott, Mrs. Stanton (Rosa Hunt) C.A. 2673
308 Abelson, Mr. Samuel P/PP 3381
874 Abelson, Mrs. Samuel (Hannah Wizosky) P/PP 3381
365 Adahl, Mr. Mauritz Nils Martin C 7076
401 Adams, Mr. John 341826
40 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) 7546
855 Aks, Mrs. Sam (Leah Rosen) 392091
207 Albimona, Mr. Nassef Cassem 2699
810 Alexander, Mr. William 3474
840 Alhomaki, Mr. Ilmari Rudolf SOTON/O2 3101287
210 Ali, Mr. Ahmed SOTON/O.Q. 3101311
784 Ali, Mr. William SOTON/O.Q. 3101312
730 Allen, Miss. Elisabeth Walton 24160
4 Allen, Mr. William Henry 373450
305 Allison, Master. Hudson Trevor 113781
297 Allison, Miss. Helen Loraine 113781
498 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) 113781
834 Allum, Mr. Owen George 2223

It appears that passengers with same surnames have the same Ticket names.

Let’s extract the surnames and tickets name and find out duplicate ones. There may be passengers in train and test dataset from the same families. So, I decided to do it as a whole.

whole_data = train_data.append(test_data)
whole_data['Surname'] = whole_data.Name.str.extract(r'([A-Za-z]+),', expand=False)
whole_data['TixPref'] = whole_data.Ticket.str.extract(r'(.*\d)', expand=False)
whole_data['SurTix'] = whole_data['Surname'] + whole_data['TixPref']
whole_data['IsFamily'] = whole_data.SurTix.duplicated(keep=False)*1
sns.countplot(whole_data.IsFamily)
plt.show()

Around 1/3 of the passengers are travelling with families.

Next, let’s dig out the families with children. Simply list out those ‘SurTix’ being families and having children.

whole_data['Child'] = whole_data.Age.map(lambda x: 1 if x <=16 else 0)
FamilyWithChild = whole_data[(whole_data.IsFamily==1)&(whole_data.Child==1)]['SurTix'].unique()
len(UniqueFamilyTixWithChild)Output:
66

There are 66 families which have 1 or more children.

Encode each family with children (i.e. Assign 0 for others).

whole_data['FamilyId'] = 0x = 1
for tix in UniqueFamilyTixWithChild:
whole_data.loc[whole_data.SurTix==tix, ['FamilyId']] = x
x += 1

let’s look at the survival data of each families with children

whole_data[‘SurvivedDemo’] = whole_data[‘Survived’].fillna(9)
pd.crosstab(whole_data.FamilyId, whole_data.SurvivedDemo).drop([0]).plot(kind=’bar’, stacked=True, color=[‘black’,’g’,’grey’])
plt.show()

It is observed that the families are usually all survived (i.e. all green) or not survived (i.e. all black). This finding proves the concept of connected survival. For each family of above, if there is at least one survived, we assume the others can survive too.

whole_data[‘ConnectedSurvival’] = 0.5 Survived_by_FamilyId = whole_data.groupby(‘FamilyId’).Survived.sum()
for i in range(1, len(UniqueFamilyTixWithChild)+1):
if Survived_by_FamilyId[i] >= 1:
whole_data.loc[whole_data.FamilyId==i, [‘ConnectedSurvival’]] = 1
elif Survived_by_FamilyId[i] == 0:
whole_data.loc[whole_data.FamilyId==i, [‘ConnectedSurvival’]] = 0
train_data = whole_data[:len(train_data)]
test_data = whole_data[len(train_data):]
sns.barplot(x='ConnectedSurvival', y='Survived', data=train_data)
plt.show()

The probability of survival is much higher for the passengers which:

  1. Travelling with family member
  2. Having 1 or more children in the family
  3. Having 1 or more survivor in the family

6. Feature Selection

train_data.columnsOutput:
Index(['Age', 'Embarked', 'Fare', 'Name', 'Parch', 'PassengerId', 'Pclass', 'Sex', 'SibSp', 'Survived', 'Ticket', 'Title', 'Sex_Code', 'AgeBin_5', 'FareBin_5', 'AgeBin_Code_5', 'FareBin_Code_5', 'FamilySize', 'Alone', 'Title_Code', 'Surname', 'TixPref', 'SurTix', 'IsFamily', 'Child', 'FamilyId', 'ConnectedSurvival'],
dtype='object')

First, drop those unused columns

X_train = train_data.drop([‘Age’, ‘Embarked’, ‘Fare’, ‘Name’, ‘Parch’, ‘PassengerId’, ‘Sex’, ‘SibSp’, ‘Survived’, ‘Ticket’, 'Title', ‘AgeBin_5’, ‘FareBin_5’, ‘FamilySize’, ‘Surname’, ‘TixPref’, ‘SurTix’, ‘IsFamily’, ‘Child’, ‘FamilyId’], axis=1)y_train = train_data[‘Survived’]

Assign model as RandomForestClassifier.

model = RandomForestClassifier(n_estimators=200, random_state=2)

Let’s look at the feature importance.

model.fit(X_train,y_train)
importance = pd.DataFrame({‘feature’:X_train.columns, ‘importance’: np.round(model.feature_importances_,3)})
importance = importance.sort_values(‘importance’, ascending=False).set_index(‘feature’)
importance.plot(kind='bar', rot=0)
plt.show()

Choose the top 5 important features for modelling (i.e. Title_Code, Sex_Code, Connected_Survivial, Pclass and FareBin_Code_5). Always keep minimal number of features to avoid over-fitting.

final = [‘Title_Code’, ‘Sex_Code’, ‘ConnectedSurvival’, ‘Pclass’, ‘FareBin_Code_5’]

7. Model

Tune model parameters.

grid_param = {
‘n_estimators’: [100, 200, 300],
‘criterion’:[‘gini’, ‘entropy’],
‘min_samples_split’: [2, 10, 20],
‘min_samples_leaf’: [1, 5],
‘bootstrap’: [True, False],
}
gd_sr = GridSearchCV(estimator=model,
param_grid=grid_param,
scoring=’accuracy’,
cv=5,
n_jobs=-1)
gd_sr.fit(X_train[final], y_train)
best_parameters = gd_sr.best_params_
print(best_parameters)
Output:
{'bootstrap': True, 'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 2, 'n_estimators': 300}

Set the model paramters after tunning.

model = RandomForestClassifier(n_estimators=300, bootstrap=True, criterion= 'entropy', min_samples_leaf=5, min_samples_split=2, random_state=2)

Calculate the accuracy of prediction using 5-fold cross-validation.

all_accuracies = cross_val_score(estimator=model1, X=X_train, y=y_train, cv=5)all_accuracies
all_accuracies.mean()
Output:
[0.86592179 0.84357542 0.83707865 0.80898876 0.88700565]
0.8485140544303522

Accuracy mean is 0.8485

8. Submission

X_test = test_data[final]model.fit(X_train[final],y_train)
prediction = model.predict(X_test)
output = pd.DataFrame({‘PassengerId’: test_data.PassengerId, ‘Survived’: prediction.astype(int)})
output.to_csv(‘my_submission.csv’, index=False)

Kaggle score is 0.82296 (Top 3%)

Thank You!

--

--