Predicting Survivors in Titanic (Kaggle)

6 min readJan 14, 2024

Introduction

Embarking on my inaugural machine learning project, I, like many others, initiated my journey with the renowned Titanic challenge on Kaggle. As an aspiring novice in the realm of machine learning, I wish to clarify that this article does not intend to offer recommendations or espouse best practices at this stage. Rather, its purpose is to serve as a documentation of my learning progression and to invite readers to contribute their valuable feedback.

Exploratory Data Analysis

There are 891 rows in train.csv, where ‘Age’ column has 177 null entries, ‘Cabin’ column has 687 null entries, and ‘Embarked’ column has 2 null entries.

train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

In the train data, 342 survived and the other 549 did not.

Class 3 dominated the train data, but apparently it has the lowest survival rate. Class 1 has the highest survival rate of 63%.

Despite having more males (65%) in the train data, more than 2/3 of the survived passengers are females. The survival rate for females is significantly higher at 75% compared to males, who have only an 18% survival rate.

The train data primarily consists of passengers aged between 20 and 40 years old. All age groups have less than 50% survival rate, except children under 10 years old, which has a little over 50% survival rate.

Most of the passengers (around 80%) travel alone or with one family member. There is no significant pattern between survival rate and number of family members travelling together on Titanic.

More than 70% of the passengers in train data embarked from Southampton. Apparently, more than 50% of passengers embarked from Cherbourg survived.

More than 80% of passengers in train data paid less than $47. Despite being the largest fare group, it has the lowest survival rate of less than 40%. The other fare groups have survival rate of more than 50%. However, it is difficult to conclude if fare group has a strong influence to survival rate due to big difference in size of the fare group.

There are many missing values in Cabin column. Cabin group is the first letter of the Cabin column. Most of Class 2 and Class 3 passengers do not have their Cabin data recorded. It is hard to conclude if Cabin group A, B, and C are dedicated for Class 1 passengers, or if there are lower class passengers in those cabins but not recorded.

Handling Missing Data

‘Embarked’ column (2 missing entries)

These are the rows with missing Embarked column. They have the same Ticket, Fare, and Cabin columns. However, they do not seem to be family as SibSp and Parch columns both show zero.

From the train data, 60% of Class 1 passengers embarked from Southampton, it is safest to assume these passengers also embarked from Southampton.

# See first class passenger and see if we can deduce 'Embarked' from there -> None
train_data_firstclass = train_data[train_data['Pclass'] == 1]
train_data_firstclass_embark_s = train_data_firstclass[train_data_firstclass['Embarked'] == 'S']
train_data_firstclass_embark_c = train_data_firstclass[train_data_firstclass['Embarked'] == 'C']
train_data_firstclass_embark_q = train_data_firstclass[train_data_firstclass['Embarked'] == 'Q']
print(len(train_data_firstclass_embark_s), len(train_data_firstclass_embark_c), len(train_data_firstclass_embark_q))

127 85 2

# Since 60% of first class passengers embarked from S, it's safer to assume passenger 62 and 830 embarked from S
train_data['Embarked'].fillna('S', inplace = True)

‘Age’ column (177 missing entries)

One way to fill in missing Age values is by taking average of other entries of the same salutation. Therefore, a new Salutation column is generated from Name column.

# Inspired from https://towardsdatascience.com/a-beginners-guide-to-kaggle-s-titanic-problem-3193cb56f6ca
train_data['Salutation'] = train_data.Name.apply(lambda name: name.split(',')[1].split('.')[0].strip())

train_data['Age'] = train_data.groupby('Salutation')['Age'].transform(lambda x: x.fillna(x.mean()))

‘Cabin’ column (687 missing entries)

Since there are too many missing values, it is wise to create a new category for to indicate missing values.

train_data['Cabin'].fillna('Na')

Feature Engineering

Drop columns: Name, Ticket, PassengerId

Drop three columns that do not seem to have any relationship with survival rate.

Add columns: Family -> SibSp + Parch

Combine SibSp and Parch columns into a new column called Family.

Convert to numeric values: Sex, CabinGroup, Salutation

Convert non-numeric values to numeric ones with get_dummies.

train_data_clean = train_data.drop(['Name', 'Ticket', 'PassengerId', 'Cabin', 'Salutation'], axis=1)
train_data_clean['Family'] = train_data_clean['SibSp'] + train_data_clean['Parch']
train_data_clean[['CabinNumber', 'dummy']] = train_data['Cabin'].astype(str).str.split(" ", n=1, expand=True)
train_data_clean['CabinNumber'] = train_data_clean['CabinNumber'].astype(str).str[1:]
train_data_clean['CabinNumber'].replace({'an': 0.0}, inplace=True)
train_data_clean['CabinNumber'].replace({'': 0.0}, inplace=True)
train_data_clean['CabinGroup'].replace({'n': 'NaN'}, inplace=True)
train_data_clean = train_data_clean.drop(['dummy'], axis=1)
train_data_clean['CabinNumber'] = train_data_clean['CabinNumber'].astype(float)

train_data_clean =  pd.get_dummies(train_data_clean, drop_first=True)

Learning

There are a number of known classification models. Here, we will use five different models with Grid Search and Cross Validation. Here’s the result of each model, along with their best parameters and accuracy score for train data. As RidgeClassifier yields the best result, it is chosen to predict the test data.

Prediction on Test Data

First, we fill in missing data in test data with the same method as filling in missing data in train data. Then, we perform feature engineering in the same manner as we did with the train data.

# Fill in missing Age
test_data['Salutation'] = test_data.Name.apply(lambda name: name.split(',')[1].split('.')[0].strip())
test_data['Age'] = train_data.groupby('Salutation')['Age'].transform(lambda x: x.fillna(x.mean()))

# Fill in missing Fare
test_data['Fare'] = train_data.groupby(['Pclass', 'Embarked'])['Fare'].transform(lambda x: x.fillna(x.mean()))

# Fill in missing Cabin
test_data['Cabin'].fillna('Na')

test_data['CabinGroup'] = test_data['Cabin'].astype(str).str[0]

test_data_clean = test_data.drop(['Name', 'PassengerId', 'Ticket', 'Cabin', 'Salutation'], axis=1)
test_data_clean['Family'] = test_data_clean['SibSp'] + test_data_clean['Parch']
test_data_clean[['CabinNumber', 'dummy']] = test_data['Cabin'].astype(str).str.split(" ", n=1, expand=True)
test_data_clean['CabinNumber'] = test_data_clean['CabinNumber'].astype(str).str[1:]
test_data_clean['CabinNumber'].replace({'an': 0.0}, inplace=True)
test_data_clean['CabinNumber'].replace({'': 0.0}, inplace=True)
test_data_clean['CabinGroup'].replace({'n': 'NaN'}, inplace=True)
test_data_clean = test_data_clean.drop(['dummy'], axis=1)
test_data_clean['CabinNumber'] = test_data_clean['CabinNumber'].astype(float)
test_data_clean =  pd.get_dummies(test_data_clean, drop_first=True)
test_data_clean['CabinGroup_T'] = False

# Make prediction with test data
scaled_test_data_clean = scaler.transform(test_data_clean)
predictions = grid_model_ridge.predict(scaled_test_data_clean)

This prediction yields a 77.5% accuracy score for test data. This is worse than the 87.8% we saw in train data.

# Submit
output = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Conclusion

There are still a number of potential improvements to this model, for example feature engineering (columns to remove, add, combine) and training methods (train-test data size, k-fold size, parameters). Any feedback is appreciated.

Source Code

The complete code can be found in this Github repository ‘Titanic Kaggle.ipynb’.