Titanic: Maritime Machine Learning
As a kid I was fascinated the fateful maiden voyage of the RMS Titanic. I had the book by Dr. Ballard on his dive on the wreckage and made a 3 foot long model. When the movie came out I was amazed by the accurate detail of the set as much as the story.
Fast forward to, gulp, over 19 years after that movie came out I found myself looking at a Data Science competition on Kaggle to determine who survived and who perished through machine learning.
For those that always flipped to the last page in the choose your-own-adventure book, here is my completed submission including all the code.
The data set had numerous variables for each passenger. For have whether or not the person survived was recorded and the goal was to use machine learning to figure out those missing Survived values as accurately as possible. The data set included Name, Sex, Age, Class, Fare, Cabin, Ticket, Embarked, Parents and Siblings.
Just a couple of quick terms if you are new to all of this:
Data Science is another name for Data Analysis but for much larger sets of data. In this case we are using machine learning algorithms to determine missing information but that is one in many tools that can be used.
Machine Learning is a type of Artificial Intelligence (AI) that can learn by itself with some supervision. The other types of AI include Deep Learning using Neural Nets and currently require extremely powerful computers to run. I was able to complete my machine learning submission entirely on my new laptop.
The programming language I used is called Python. It is a general programming language with libraries (think a bunch of code prebuilt for your use) for mathematics, data wrangling and visualization. I used a program called Jupyter, originally called iPython, that allows me to execute the code and add markup language to explain what I am doing in an easy format. The best way to install Python and Jupyter is to use Anaconda which is a bundle of programs and libraries. I have also used R in R Studio in the past but I wanted to learn Python for this project.
First one must import the comma delimited file/.csv into a Python data frame which an array structured much like an Excel spreadsheet with columns and rows. Pandas is a great library to explore the data and a couple of simple commands like df.describe() and df.info() give an overview of what is in there and what is missing.
There were several variables that were missing which had to be filled in before any machine learning could be applied. Missing values in Age, Cabin, Fare and Embarked were replaced with the mean (i.e. average) with a couple of simple lines.
In this case a scatter plot where blue dots are people who perished (0) and green (1) are those that survived:
As it is well known the call for women and children first had an obvious and direct affect on those survived:
An important part of machine learning is Feature Engineering, basically trying to expand and modify a data set to better be consumable by the machine learning algorithm. Boolean values (i.e. true/false) are ideal so part of the process is splitting a column like Class with three values (1, 2, 3) into three columns per class with boolean value (0,1).
Additionally extracting the titles like Mr. from the Name helps as you can see there is a large correlation:
Likewise the ticket fares can tell a little bit as well. By separating the range into discrete buckets or bin, it becomes clearer the more you paid the more likely you were to survive.
After all of this work you need to see correlation between features, the closer to positive 1 or negative 1 (or dark read and dark blue) the higher the relationship. It is telling that this heat map is mostly lightly colored which implies that correlation is pretty low and accuracy will be low:
There are multiple types of machine learning models. In this case I used Random Forest, but SVC, Gradient Boosting, K Neighbors Classifier, Gaussian NB and Logistic Regressions are all options. I will devote a separate post to unravelling each of these a bit further.
In order for machine learning to work the data must be split into a training set and a test set. The training set has all the features including the value we are trying to solve for (in this case whether they survived). This is what the machine learning algorithm will practice on to fine tune its decision trees.
One of the dangers of machine learning is overfitting. When the model learned the training set of data too well the model ends up not generalized enough to accurately predict the missing values. It is like you learned that 2+3=5 but never learned how addition works outside of that specific example.
Once you have fine tuned your model, it is time to throw your test data at it and get your results. In my case I ended up with .08395 accuracy, far from perfect but not horrible given the data.
Again, here is my submission including all the code if you want to get deep in the weeds.
If you create a business anyone will tell you will spend most of your time on the inglorious stuff like paperwork and taxes. In Data Science this is also true that most of the work is cleaning and feature engineering. Unsurprisingly the better what you put in, the more accurate the result out.
In truth this this exercise was also an incredible way to get a feeling for just how powerful AI can be. I felt like I was literally touching the tip of a massive iceberg whose implications are both exciting and terrifying at the same time. Given I remember working on black and green screen computers as a kid, the speed of technology is awe-inspiring and to go from not knowing any Python and making this in three weeks shows how much more accessible it is.
I am looking forward to the next set of data I am working on: Burlington VT Housing Data. Who knows what I might find…