# [Kaggle] Titanic Survival Prediction — Top 3%

Nov 27, 2019 · 13 min read

After studying one month of Python, I plan to work on projects to apply my knowledge. Kaggle is a great platform which holds machine learning competition and provides real-world datasets. As my first attempt, I have spent 10 days in total for this project. Thanks to online resources such as Stackoverflow and articles from Medium which help a lot!

# Index

1. Background
2. Exploratory Data Analysis
3. Imputation of Missing Data/ Outliers
4. Data Transformation
5. Feature Creation
6. Feature Selection
7. Model
8. Submission

# 1. Background

## The Challenge

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## Overview

The data has been split into two groups:

• training set (train.csv)
• test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

## Data Dictionary

VariableDefinitionKeysurvivalSurvival0 = No, 1 = YespclassTicket class1 = 1st, 2 = 2nd, 3 = 3rdsexSexAgeAge in yearssibsp# of siblings / spouses aboard the Titanicparch# of parents / children aboard the TitanicticketTicket numberfarePassenger farecabinCabin numberembarkedPort of EmbarkationC = Cherbourg, Q = Queenstown, S = Southampton

## Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way…
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way…
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

## Goal

It is your job to predict if a passenger survived the sinking of the Titanic or not.
For each in the test set, you must predict a 0 or 1 value for the variable.

## Metric

Your score is the percentage of passengers you correctly predict. This is known as accuracy.

# 2. Exploratory Data Analysis

Import libraries:

`import numpy as npimport osimport pandas as pdimport seaborn as snsfrom matplotlib import pyplot as pltfrom sklearn import metricsfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.feature_selection import RFECVfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_scorefrom sklearn.model_selection import cross_val_score, GridSearchCVfrom sklearn.preprocessing import LabelEncoderimport warningswarnings.filterwarnings(‘ignore’)`

Import data:

`train_data = pd.read_csv(‘train.csv’)test_data = pd.read_csv(‘test.csv’)`

Data structure:

`train_data.info()Output:<class 'pandas.core.frame.DataFrame'>RangeIndex: 891 entries, 0 to 890Data columns (total 12 columns):PassengerId    891 non-null int64Survived       891 non-null int64Pclass         891 non-null int64Name           891 non-null objectSex            891 non-null objectAge            714 non-null float64SibSp          891 non-null int64Parch          891 non-null int64Ticket         891 non-null objectFare           891 non-null float64Cabin          204 non-null objectEmbarked       889 non-null objectdtypes: float64(2), int64(5), object(5)memory usage: 83.7+ KBtest_data.info()Output:<class 'pandas.core.frame.DataFrame'>Int64Index: 418 entries, 0 to 417Data columns (total 13 columns):Age            332 non-null float64Cabin          91 non-null objectEmbarked       418 non-null objectFare           417 non-null float64Name           418 non-null objectParch          418 non-null int64PassengerId    418 non-null int64Pclass         418 non-null int64Sex            418 non-null objectSibSp          418 non-null int64Survived       0 non-null float64Ticket         418 non-null objectTitle          418 non-null objectdtypes: float64(3), int64(4), object(6)memory usage: 45.7+ KB`
1. Total no. of rows: 891 for train and 418 for test dataset.
2. Cabin: data missing > 70% .
Since data with less than 30% can’t provide meaning information, Cabin data can be ignored.
`train_data.drop([‘Cabin’], axis=1, inplace=True)test_data.drop([‘Cabin’], axis=1, inplace=True)`

Survived

`sns.countplot(train_data.Survived)plt.show()`

Overall probability of survivial ~38%

Pclass

`sns.countplot(train_data.Pclass)plt.show()`
`sns.barplot(x=’Pclass’, y=’Survived’, data=train_data)plt.show()`

1. Passengers in Pclass 1 (Upper class) are more likely to survive.

2. Pclass is a good feature for prediction of survival.

Sex

`sns.countplot(train_data.Sex)plt.show()`
`sns.barplot(x=’Sex’, y=’Survived’, data=train_data)plt.show()`

1. Proportion of male and female: ~2/3 vs ~1/3

2. Male is much less likely to survive, with only 20% chance of survival. For female, >70% chance of survival.

3. Obviously, Sex is an important feature to predict survival.

Age

`plt.hist(train_data.Age, edgecolor=’black’)plt.xlabel('Age')plt.ylabel('count')plt.show()`
`sns.boxplot(x=’Survived’, y=’Age’, data=train_data)plt.show()`

1. Passengers are mainly aged 20–40.

2. Younger passengers tends to survive.

SibSp

`sns.countplot(train_data.SibSp)plt.show()`
`sns.barplot(x=’SibSp’, y=’Survived’, data=train_data)plt.show()`

1. Most of the passengers travel with 1 sibling/spouse.

2. Passengers having 1 sibling/spouse are more likely to survive compared to those not.

3. For those more than 1 siblings/spouses, the information is insufficient to provide any insight.

Parch

`sns.countplot(train_data.Parch)plt.show()`
`sns.barplot(x=’Parch’, y=’Survived’, data=train_data)plt.show()`

1. >70% passengers travel without parents/children.

2. Passengers travelling with parents/children are more likely to survive than those not.

Ticket

`train_data.Ticket.head(10)Output:0 A/5 211711 PC 175992 STON/O2. 31012823 1138034 3734505 3308776 174637 3499098 3477429 237736Name: Ticket, dtype: object`

Fare

`sns.distplot(train_data.Fare)plt.show()`
1. The distribution is right-skewed. Outliers are observed.
2. For those who survived, their fares are relatively higher.

Embarked

`sns.countplot(train_data.Embarked)plt.show()`
`sns.barplot(x=’Embarked’, y=’Survived’, data=train_data)plt.show()`

1. >2/3 passengers embarked at Port C.

2. Passengers embarked at Port C are more likely to survive.

# 3. Imputation of Missing Data/ Outliers

Age

`train_data.Name.head(10)Output:0 Braund, Mr. Owen Harris1 Cumings, Mrs. John Bradley (Florence Briggs Th…2 Heikkinen, Miss. Laina3 Futrelle, Mrs. Jacques Heath (Lily May Peel)4 Allen, Mr. William Henry5 Moran, Mr. James6 McCarthy, Mr. Timothy J7 Palsson, Master. Gosta Leonard8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)9 Nasser, Mrs. Nicholas (Adele Achem)Name: Name, dtype: object`

Let’s extract the titles (Mr./Mrs./Miss/Master) from the names of passengers. This can be done for both train and test datasets.

`whole_data = train_data.append(test_data)whole_data['Title'] = whole_data.Name.str.extract(r'([A-Za-z]+)\.', expand=False)whole_data.Title.value_counts()Output:Mr          757Miss        260Mrs         197Master       61Dr            8Rev           8Col           4Ms            2Major         2Mlle          2Countess      1Jonkheer      1Don           1Sir           1Dona          1Capt          1Mme           1Lady          1Name: Title, dtype: int64`

The common titles are(Mr/Miss/Mrs/Master). Some of the titles (Ms/Lady/Sir…etc.) can be grouped to the common titles. The remaining unclassified titles can be frouped to “Others”.

`Common_Title = [‘Mr’, ‘Miss’, ‘Mrs’, ‘Master’]whole_data[‘Title’].replace([‘Ms’, ‘Mlle’, ‘Mme’], ‘Miss’, inplace=True)whole_data[‘Title’].replace([‘Lady’], ‘Mrs’, inplace=True)whole_data[‘Title’].replace([‘Sir’, ‘Rev’], ‘Mr’, inplace=True)whole_data[‘Title’][~whole_data.Title.isin(Common_Title)] = ‘Others’`

Let’s look at the relationship between titiles and age in train dataset.

`train_data = whole_data[:len(train_data)]test_data = whole_data[len(train_data):]sns.boxplot(x='Title', y='Age', data=train_data)plt.show()`

Find the median of Age in each title.
(Remarks: only use train dataset to avoid data leakage)

`AgeMedian_by_titles = train_data.groupby(‘Title’)[‘Age’].median()AgeMedian_by_titlesOutput:TitleMaster     3.5Miss      21.5Mr        30.0Mrs       35.0Others    47.0Name: Age, dtype: float64`

Impute the missing Age values according to the titles.

`for title in AgeMedian_by_titles.index:    train_data['Age'][(train_data.Age.isnull()) & (train_data.Title == title)] = AgeMedian_by_titles[title]    test_data['Age'][(test_data.Age.isnull()) & (test_data.Title == title)] = AgeMedian_by_titles[title]`

Embarked

For train dataset, there are only 2 missing values. Simply impute the mode.

`train_data[‘Embarked’].fillna(train_data.Embarked.mode()[0], inplace=True)`

Fare

For test dataset, there is only 1 missing value. Simply impute the median,

`test_data[‘Fare’].fillna(test_data[‘Fare’].median(), inplace=True)`

For train dataset, there 3 outliers (i.e. 512.3292).

`For train dataset, there are outliers observed. Replace them with median.train_data.Fare.sort_values(ascending=False).head(5)Output:679 512.3292258 512.3292737 512.3292341 263.0000438 263.0000`

Outliers should be handled in order not to distort the distribution and thus make the model more robust.

Outliers can be replaced with maximum cap, median or you can simply remove them.

I choose to replace the outliers the 2nd higher fare (i.e. 263).

`train_data.loc[train_data.Fare>512, ‘Fare’] = 263train_data.Fare.sort_values(ascending=False).head(5)Output:341    263.0438    263.088     263.0679    263.0258    263.0Name: Fare, dtype: float64`

Check for missing data.

`train_data.info()Output:<class 'pandas.core.frame.DataFrame'>Int64Index: 891 entries, 0 to 890Data columns (total 12 columns):Age            891 non-null float64Embarked       891 non-null objectFare           891 non-null float64Name           891 non-null objectParch          891 non-null int64PassengerId    891 non-null int64Pclass         891 non-null int64Sex            891 non-null objectSibSp          891 non-null int64Survived       891 non-null float64Ticket         891 non-null objectTitle          891 non-null objectdtypes: float64(3), int64(4), object(5)memory usage: 90.5+ KBtest_data.info()<class 'pandas.core.frame.DataFrame'>Int64Index: 418 entries, 0 to 417Data columns (total 12 columns):Age            418 non-null float64Embarked       418 non-null objectFare           418 non-null float64Name           418 non-null objectParch          418 non-null int64PassengerId    418 non-null int64Pclass         418 non-null int64Sex            418 non-null objectSibSp          418 non-null int64Survived       0 non-null float64Ticket         418 non-null objectTitle          418 non-null objectdtypes: float64(3), int64(4), object(5)memory usage: 42.5+ KB`

# 4. Data Transformation

Encode string to numbers for modelling.

Sex

`train_data[‘Sex_Code’] = train_data[‘Sex’].map({‘female’:1, ‘male’:0}).astype(‘int’)test_data[‘Sex_Code’] = test_data[‘Sex’].map({‘female’:1, ‘male’:0}).astype(‘int’)`

Embarked

`train_data['Embarked_Code'] = train_data['Embarked'].map({'S':0, 'C':1, 'Q':2}).astype('int')test_data['Embarked_Code'] = test_data['Embarked'].map({'S':0, 'C':1, 'Q':2}).astype('int')`

Group data into bins to make the model more robust and avoid over-fitting.

Age

`train_data['AgeBin_5'] = pd.qcut(train_data['Age'], 5)test_data['AgeBin_5'] = pd.qcut(test_data['Age'], 5)sns.barplot(x='AgeBin_5', y='Survived', data=train_data)plt.show()`

Fare

`train_data[‘FareBin_5’] = pd.qcut(train_data[‘Fare’], 5)test_data[‘FareBin_5’] = pd.qcut(test_data[‘Fare’], 5)`

Encode the Age and Fare bins into numbers for modelling.

`label = LabelEncoder()train_data['AgeBin_Code_5'] = label.fit_transform(train_data['AgeBin_5'])test_data['AgeBin_Code_5'] = label.fit_transform(test_data['AgeBin_5'])label = LabelEncoder()train_data[‘FareBin_Code_5’] = label.fit_transform(train_data[‘FareBin_5’])test_data[‘FareBin_Code_5’] = label.fit_transform(test_data[‘FareBin_5’])`

# 5. Feature Creation

Alone

SibSp and Parch are both related to family members. for simple sake, I decided to combined them into a single feature namely FamilySize.

`train_data[‘FamilySize’] = train_data.SibSp + train_data.Parch + 1test_data[‘FamilySize’] = test_data.SibSp + test_data.Parch + 1sns.countplot(train_data.FamilySize)plt.show()`

Since the proportion of FamilySize=1 is dominant, it may not provide sufficeint predictive power. I decided to group them and convert to travelling alone or not.

`train_data[‘Alone’] = train_data.FamilySize.map(lambda x: 1 if x == 1 else 0)test_data[‘Alone’] = test_data.FamilySize.map(lambda x: 1 if x == 1 else 0)sns.countplot(train_data.Alone)plt.show()`
`sns.barplot(x=’Alone’, y=’Survived’, data=train_data)plt.show()`

It is observed that travelling alone is less likely to survive (~30% vs ~50%).

Title

Title is created for the imputation missing values of Age. It can also be used as a new features.

`sns.countplot(train_data.Title)plt.show()`
`sns.barplot(x=’Title’, y=’Survived’, data=train_data)plt.show()`

It is obviously that Title Mr. is much less likely to survive compared to others .

Let’s encode the features for modelling.

`train_data[‘Title_Code’] = train_data.Title.map({‘Mr’:0, ‘Miss’:1, ‘Mrs’:2, ‘Master’:3, ‘Others’:4}).astype(‘int’)test_data[‘Title_Code’] = test_data.Title.map({‘Mr’:0, ‘Miss’:1, ‘Mrs’:2, ‘Master’:3, ‘Others’:4}).astype(‘int’)`

Connected Survival

From the Titanic movie, those survived were often in family groups. They helped each other to find way out. In addition, families usually have children which are the first priority to sent to safe boats. Of course, there should be parents to take care of their children.

To find out family groups, apart from surnames of passenges (there may be same surnames but different families), let’s also look at Ticket.

`train_data[[‘Name’, ‘Ticket’]].sort_values(‘Name’).head(20)Output:845                              Abbing, Mr. Anthony           C.A. 5547746                      Abbott, Mr. Rossmore Edward           C.A. 2673279                 Abbott, Mrs. Stanton (Rosa Hunt)           C.A. 2673308                              Abelson, Mr. Samuel           P/PP 3381874            Abelson, Mrs. Samuel (Hannah Wizosky)           P/PP 3381365                   Adahl, Mr. Mauritz Nils Martin              C 7076401                                  Adams, Mr. John              34182640    Ahlin, Mrs. Johan (Johanna Persdotter Larsson)                7546855                       Aks, Mrs. Sam (Leah Rosen)              392091207                      Albimona, Mr. Nassef Cassem                2699810                           Alexander, Mr. William                3474840                      Alhomaki, Mr. Ilmari Rudolf    SOTON/O2 3101287210                                   Ali, Mr. Ahmed  SOTON/O.Q. 3101311784                                 Ali, Mr. William  SOTON/O.Q. 3101312730                    Allen, Miss. Elisabeth Walton               241604                           Allen, Mr. William Henry              373450305                   Allison, Master. Hudson Trevor              113781297                     Allison, Miss. Helen Loraine              113781498  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)              113781834                           Allum, Mr. Owen George                2223`

It appears that passengers with same surnames have the same Ticket names.

Let’s extract the surnames and tickets name and find out duplicate ones. There may be passengers in train and test dataset from the same families. So, I decided to do it as a whole.

`whole_data = train_data.append(test_data)whole_data['Surname'] = whole_data.Name.str.extract(r'([A-Za-z]+),', expand=False)whole_data['TixPref'] = whole_data.Ticket.str.extract(r'(.*\d)', expand=False)whole_data['SurTix'] = whole_data['Surname'] + whole_data['TixPref']whole_data['IsFamily'] = whole_data.SurTix.duplicated(keep=False)*1sns.countplot(whole_data.IsFamily)plt.show()`

Around 1/3 of the passengers are travelling with families.

Next, let’s dig out the families with children. Simply list out those ‘SurTix’ being families and having children.

`whole_data['Child'] = whole_data.Age.map(lambda x: 1 if x <=16 else 0)FamilyWithChild = whole_data[(whole_data.IsFamily==1)&(whole_data.Child==1)]['SurTix'].unique()len(UniqueFamilyTixWithChild)Output:66`

There are 66 families which have 1 or more children.

Encode each family with children (i.e. Assign 0 for others).

`whole_data['FamilyId'] = 0x = 1for tix in UniqueFamilyTixWithChild: whole_data.loc[whole_data.SurTix==tix, ['FamilyId']] = x x += 1`

let’s look at the survival data of each families with children

`whole_data[‘SurvivedDemo’] = whole_data[‘Survived’].fillna(9)pd.crosstab(whole_data.FamilyId, whole_data.SurvivedDemo).drop([0]).plot(kind=’bar’, stacked=True, color=[‘black’,’g’,’grey’])plt.show()`

It is observed that the families are usually all survived (i.e. all green) or not survived (i.e. all black). This finding proves the concept of connected survival. For each family of above, if there is at least one survived, we assume the others can survive too.

`whole_data[‘ConnectedSurvival’] = 0.5 Survived_by_FamilyId = whole_data.groupby(‘FamilyId’).Survived.sum()for i in range(1, len(UniqueFamilyTixWithChild)+1): if Survived_by_FamilyId[i] >= 1: whole_data.loc[whole_data.FamilyId==i, [‘ConnectedSurvival’]] = 1 elif Survived_by_FamilyId[i] == 0: whole_data.loc[whole_data.FamilyId==i, [‘ConnectedSurvival’]] = 0train_data = whole_data[:len(train_data)]test_data = whole_data[len(train_data):]sns.barplot(x='ConnectedSurvival', y='Survived', data=train_data)plt.show()`

The probability of survival is much higher for the passengers which:

1. Travelling with family member
2. Having 1 or more children in the family
3. Having 1 or more survivor in the family

# 6. Feature Selection

`train_data.columnsOutput:Index(['Age', 'Embarked', 'Fare', 'Name', 'Parch', 'PassengerId', 'Pclass', 'Sex', 'SibSp', 'Survived', 'Ticket', 'Title', 'Sex_Code', 'AgeBin_5', 'FareBin_5', 'AgeBin_Code_5', 'FareBin_Code_5', 'FamilySize', 'Alone', 'Title_Code', 'Surname', 'TixPref', 'SurTix', 'IsFamily', 'Child', 'FamilyId', 'ConnectedSurvival'],dtype='object')`

First, drop those unused columns

`X_train = train_data.drop([‘Age’, ‘Embarked’, ‘Fare’, ‘Name’, ‘Parch’, ‘PassengerId’, ‘Sex’, ‘SibSp’, ‘Survived’, ‘Ticket’, 'Title', ‘AgeBin_5’, ‘FareBin_5’, ‘FamilySize’, ‘Surname’, ‘TixPref’, ‘SurTix’, ‘IsFamily’, ‘Child’, ‘FamilyId’], axis=1)y_train = train_data[‘Survived’]`

Assign model as RandomForestClassifier.

`model = RandomForestClassifier(n_estimators=200, random_state=2)`

Let’s look at the feature importance.

`model.fit(X_train,y_train)importance = pd.DataFrame({‘feature’:X_train.columns, ‘importance’: np.round(model.feature_importances_,3)})importance = importance.sort_values(‘importance’, ascending=False).set_index(‘feature’)importance.plot(kind='bar', rot=0)plt.show()`

Choose the top 5 important features for modelling (i.e. Title_Code, Sex_Code, Connected_Survivial, Pclass and FareBin_Code_5). Always keep minimal number of features to avoid over-fitting.

`final = [‘Title_Code’, ‘Sex_Code’, ‘ConnectedSurvival’, ‘Pclass’, ‘FareBin_Code_5’]`

# 7. Model

Tune model parameters.

`grid_param = { ‘n_estimators’: [100, 200, 300], ‘criterion’:[‘gini’, ‘entropy’], ‘min_samples_split’: [2, 10, 20], ‘min_samples_leaf’: [1, 5], ‘bootstrap’: [True, False],}gd_sr = GridSearchCV(estimator=model, param_grid=grid_param, scoring=’accuracy’, cv=5, n_jobs=-1)gd_sr.fit(X_train[final], y_train)best_parameters = gd_sr.best_params_print(best_parameters)Output:{'bootstrap': True, 'criterion': 'entropy', 'min_samples_leaf': 5, 'min_samples_split': 2, 'n_estimators': 300}`

Set the model paramters after tunning.

`model = RandomForestClassifier(n_estimators=300, bootstrap=True, criterion= 'entropy', min_samples_leaf=5, min_samples_split=2, random_state=2)`

Calculate the accuracy of prediction using 5-fold cross-validation.

`all_accuracies = cross_val_score(estimator=model1, X=X_train, y=y_train, cv=5)all_accuraciesall_accuracies.mean()Output:[0.86592179 0.84357542 0.83707865 0.80898876 0.88700565]0.8485140544303522`

Accuracy mean is 0.8485

# 8. Submission

`X_test = test_data[final]model.fit(X_train[final],y_train)prediction = model.predict(X_test)output = pd.DataFrame({‘PassengerId’: test_data.PassengerId, ‘Survived’: prediction.astype(int)})output.to_csv(‘my_submission.csv’, index=False)`

Kaggle score is 0.82296 (Top 3%)

Thank You!

Written by

## Analytics Vidhya

#### Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade