Uncovering Titanic Survival Patterns: Data-Driven Insights for Passenger Survival

Shivang Kumar
6 min readDec 15, 2023

--

The “unsinkable” Titanic met its chilly demise in 1912, sending hundreds of people falling into the icy Atlantic. Beyond the majestic halls, a hidden story lives within the data — a tale of survival conveyed in numbers. Join us as we break the code, revealing the unforeseen elements that drove passengers towards life or the frigid depths. Prepare to delve into the facts and rebuild the Titanic’s story, one life at a time.

Source

Dataset Description

The dataset used in this analysis, featuring the detailed profiles of passengers aboard the Titanic, has been sourced from the repository of open datasets available at Awesome Public Datasets.
* PassengerId: A unique identifier assigned to each passenger, aiding in individual distinction.
* Survived: A crucial binary variable denoting survival (0 = No, 1 = Yes) during the tragedy.
* Pclass: Reflecting the socio-economic status, indicating the class of travel (1st, 2nd, or 3rd).
* Name: The names of passengers encompassing diverse backgrounds and identities.
* Sex: Gender of the passenger, offering insights into the demographic composition.
* Age: The age of passengers at the time of embarkation.
* SibSp & Parch: These variables signify family connections, detailing the number of siblings/spouses and parents/children aboard.
* Ticket & Fare: Ticket specifics and fare paid reflect varying economic capacities among passengers.
* Cabin: Cabin information showcasing accommodation details for passengers.
* Embarked: Port of embarkation, depicting the starting point of the Titanic’s voyage.
In this case, our target variable is the Survived column from the dataset.

Data Cleaning using PredictEasy

We deleted the PassengerId from the dataset as it is a unique identifier, so it will not help the model. After this, we check for missing values in the dataset, as there are some missing values in the age and cabin columns.

We create categorical labels for age groups and genders, handling blank ages by marking them as ‘Unknown’. The categorization helps analyse and segment the data based on age and gender demographics.

Excel formula to categorize
  • IF(ISBLANK(D2), …): This section checks if the ‘Age’ column (column D) is blank. If it’s blank, it assigns an ‘Unknown’ category based on gender.
  • IF(C2=”male”, …): If the ‘Age’ column is not blank, it further categorizes based on gender (‘male’ or ‘female’).
  • For males:
    It then categorizes based on age ranges:
    0–12 years old: “Male_Child”
    13–18 years old: “Male_Teen”
    19–30 years old: “Male_Young Adult”
    Other ages: “Male_Other”
  • For females:
    Similar categorization based on age ranges:
    0–12 years old: “Female_Child”
    13–18 years old: “Female_Teen”
    19–30 years old: “Female_Young Adult”
    Other ages: “Female_Other”

After doing this, we encode the value of these categorical labels for age groups and genders by one of PredictEasy features.

Label Encode by PredictEasy

For this situation, we must select the column you want to encode and then select the preferred output column, L2. If that column has a header, then tick the header checkbox. After that, click the encode button, which will automatically encode it for you.

Encoded values by PredictEasy

We use the same technique for the cabin column, but in this case, we only fill the missing value with “Unknown” and extract the first character from the ‘Cabin’ cell (I column in this case), representing the deck information.

Excel formula to extract cabin detail
  • IF(ISBLANK(I2), ...) checks if the cell in the 'Cabin' column (specifically I2 in this case) is blank.
  • If the cell is blank:
    "Unknown" is returned, implying no cabin information is available.
  • If the cell is not blank:
  • LEFT(I2, 1) extracts the first character from the 'Cabin' cell (I2), representing the deck information.
  • The LEFT function in Excel extracts a specified number of characters from the start of a text string.
  • LEFT(I2, 1) extracts only the first character from the 'Cabin' cell, which typically denotes the deck information (e.g., A, B, C, etc.).

After doing this, we encode the value of the new_cabin column using the same technique as PredictEasy features.

After that, we Encoded the dataset’s ‘Ticket’ and ‘Embarked’ columns. The ‘Ticket’ column played a crucial role in uncovering hidden patterns. Transforming alphanumeric ticket data into numerical representations revealed clusters of passengers sharing the same ticket numbers.

We also Combined ‘SibSp’ and ‘Parch’ columns to provide a comprehensive view of each passenger’s family size on the Titanic.

PredictEasy Analysis

After the data preprocessing comes the analysis part, as seen in the previous blog (Analyzing Direct Marketing Campaigns for Term Deposits in a Portuguese Banking Dataset). We use the same technique in this dataset that we used in the last blog. We use all the variables from the dataset to predict the target variable (Survived).

Feature Rank by PredictEasy
SHAP Plot by PredictEasy

Using these two graphs, we now know how to improve this model by removing non-contributing variables like the ‘Embarked’ column, which has less percentage in feature rank, and the ‘fare’ column, which has a low feature value in the SHAP plot. We remove the ‘Name’ column from the dataset because it doesn’t directly depend on individual names but focuses on other factors like demographics, family sizes, or ticket-related features. Removing the ‘Name’ column can simplify the dataset, reducing unnecessary information.

Fine-Tunning

After removing unimportant features, we again ran PredictEasy, and this time, we got to see this:

Feature Rank by PredictEasy

We removed the ‘Ticket’ column from the dataset as it has less feature rank, and again, we ran the PredictEasy. This time, we got our final model, and its results are shown below:

PredictEasy Observations

Actionable Insights by PredictEasy
Actionable Insights by PredictEasy

PredictEasy also gives us the factors impacting the survival of the dataset. From this, we know that “sex_age_group” is the most crucial factor, emphasizing the importance of prioritizing women and children (women and children might be physically more vulnerable due to differences in physical strength compared to adult males) in evacuation procedures.

PredictEasy Recommendations

Actionable Insights by PredictEasy

Potential idea for future models:

  • Exploration of Name-Based Historical Insights: In this analysis, we did not utilized to explore historical contexts or social connections within the ‘Name’ column. This could involve tracing family ties, identifying influential individuals, or understanding social hierarchies aboard the Titanic for broader historical insights.

Further Reading

Decoding the Endgame: Navigating Tic-Tac-Toe’s Final Moves by Elsa
Predicting students’ dropout and academic success by Priya Shahari

--

--