Exploring the Hidden Treasures of Titanic using Pandas and Seaborn
In the early 1900s, two of the most prominent ship lines at the time—White Star and Cunard—were competing against each other to transport wealthy travelers across the Atlantic ocean. Cunard fired the first bullet in increasing its market share by introducing two new ships, Lusitania and Mauretania, which were later known for sailing across the Atlantic ocean in record times. As a response to this move by Cunard, White Star decided to build three ships that would be known among its customers for its comfort rather than speed—which is a brilliant example of understanding your customers' needs and competitors' weaknesses and catering to those specifics rather than making a pissing contest out of it by trying to build faster ships to beat Cunard. Titanic was one of those three ships that were built by White Star during this time to challenge Cunard’s market position.
On April 10, 1912, the Titanic set sail on its maiden voyage from Southampton, England to New York with two stops in-between at Cherbourg, France and Queenstown, Ireland . After sideswiping an iceberg off the coast of Newfoundland in the North Atlantic, in the early hours of April 15, 1912, the supposedly “unsinkable” Titanic sank to the bottom of the sea.
The Kaggle Titanic test data set contains dummy data of 891 passengers who took part in this magnificent yet daunting voyage. Below is a snapshot of the the dataset with the initial variables that are provided at a passenger level. ‘SibSp’ provides us the number of siblings or spouses on board and ‘Parch’ gives the number of parents or children on board of each passenger. An interesting fact about the data set to keep in mind before going into feature engineering is that it is assumed that none of the passengers had their mistresses and fiancés onboard along with their partners. If you seek to know more about the variables, explore the data dictionary provided by Kaggle.
Data Cleaning
Before diving into the data set, some pre-cautionary data cleaning steps were taken to make the data set easier to work with at the EDA stage of the analysis. The data cleaning mainly consisted of converting abbreviation terms given for classes and ports of embarkation into words and converting all ages to integer type. The latter task was carried out mainly since the data set consisted fraction values for passengers under the age of 1.
Feature Engineering
Age Brackets
The purpose of introducing age brackets into the dataset was solely to have a breakdown by age to help in exploring data at a categorical level for age. Below is the breakdown of age by categories after the assignments.
Cabin Type
After a closer examination of the data provided as cabin IDs for each passenger, a pattern can be seen where a letter is given before a number. Which leads us to believe that there were multiple types of cabins in the ship and each initial letter represents a different type of cabin. Below is the breakdown of cabins after assignment.
Marital Status
Even though the data set doesn’t provide the marital status of each passenger, the tiles provided in the name combined with a few reasonable rules and assumptions were helpful in assigning each passenger to a binary (“Single” or “Married”) marital status. The rules and assumptions made in the process can be seen below. In addition to the tiles given in the name, we also consider every passenger under the age of 18 to be not married (even in cases where the title said “Mr.” and “Mrs.”)
Spouses & Siblings OB
As mentioned earlier, a passenger is only allowed to have one partner through out the journey. Which means that every time a married passenger had more than 1 under ‘SibSp’, it meant the passenger had both a partner and sibling/siblings on board. And if a passenger had a value higher than 0 for ‘SibSp’ while being single, the ‘SibSp’ value directly gives the number of siblings on board. Spouses OB is assigned to zero in cases where ‘SibSp’ is none, regardless of marital status. Siblings OB is assigned to zero when a passenger is single and also when a passenger is married and ‘SibSp’ is one.
Parents and Children OB
In comparison to the assignments of spouses and siblings on board, assignment of the number of parents and children on board was fairly simple due to two main assumptions. In the case of parents, we assumed that a married passenger wouldn’t bring along their parents to a cruise (only because it’s really uncool) and the number given in ‘ParCh’ is simply the number of children on board of that passenger. A similar type of assumption was made in the case of children by assuming that children won’t be accompanied by single passengers.
After the above data cleaning and feature engineering, we are left with a comprehensive data set to work with in our exploratory data analysis. If you are keen to learn more about the feature engineering that was carried out above, check out this video of myself explaining the process.
Exploratory Data Analysis
Fare
When we observe the distribution of ticket fare across age, class, marital status, and port of embarkation, it can be seen that in most cases, female passengers were charged more compared to male passengers. A pattern can be seen for female passengers where the fare increases with age, however, a slight drop can be seen in teens and young adults for male passengers. The fair is distributed across classes as expected with first class passengers charged the highest, while little to no difference can be seen between married and single passengers. Port of embarkation mapped against fair suggests that the French were charged the most and the Irish were charged the least.
Survival
The two graphs above show us that the probability of survival for a passenger is negatively correlated with age and positively correlated with fare. In simpler terms, if you were an old and poor passenger in the Titanic, chances of you getting into a lifeboat or holding on to a door for dear life are pretty slim.
Female passengers have a higher chance of survival than men across all categories. The chance of survival has increased with age for women, while teens and older adults have showed the lowest chance of survival wen it comes to men. Across classes, passenger in first class shows the highest chance of survival. Married men scored the lowest when chance of survival was mapped against marital status, while married women scored the highest.
The joint plots above show us that passengers who came on board with a spouse had a higher chance of survival. The chance of survival increased with the number of siblings onboard up until two and then drastically reduced after. The survival rate also reduced with increases in children on board and parents onboard.
Now that we have explored the data after numerous steps of data cleaning and feature engineering, our next step would be to train a machine learning model by using this data and predict a passenger’s chance of survival given new input data. Last but not least, a meme created using some of the insights gathered from the data set.