A Short Practical Introduction to Machine Learning: Predicting Survival on the Titanic!

Thompson Go
Analytics Vidhya
Published in
13 min readOct 31, 2020
Photo by Annie Spratt on Unsplash

Machine Learning can be intimidating. I know this at first hand. The first time I was given an assignment at work involving Machine Learning, it didn’t really go well. I dove into it without having any background on the subject, and my code was just all over the place. I thought Machine Learning was just about plugging data into a model and then getting your results back. Yes, those steps are involved, but there is so much more to it.

In an effort to gain a background on Machine Learning, or at least some idea on how it is properly done, I’ve taken to spending some time to learn more about. One of the challenges I set to myself was making my first submission to Kaggle. In this article, I will show you how I did it (you can also code-along with me if you’d like), and in the process, if you have no idea what Machine Learning is yet, I hope it will introduce you to it. Don’t worry if you don’t get some of the concepts below now, I don’t expect you to (that’s normal). The point of this article is to give you a basic intro to Machine Learning.

What is Machine Learning?

First, let’s start with what Machine Learning is. Machine Learning is the technique of creating programs that can learn from data and to come up with their own algorithm (usually called a model) for solving a problem. It differs from the usual programming where you are dictating a series of steps for a program to follow. Machine Learning is a more challenging feat, but if done well, it can also give you broad flexibility and your users a tailored experience.

What is Kaggle?

Kaggle is a great place for exploring Machine Learning. For those of you who have no idea on what it is, it is online community for Machine Learning and Data Science enthusiasts and practitioners alike. You can find thousands of datasets and problems posted on it to practice Machine Learning. They even have an online course which you can go through to learn Python and Machine Learning — I highly recommend checking this out if you are thinking of going into Machine Learning.

Problem / Objective

The Titanic Dataset is kind of like the Kaggle “Hello World!” project. It is not too complicated, which is perfect for beginners — like me, and it is what they recommend you to start with. The objective of the project is to create a model that can predict passengers who survived the Titanic shipwreck. This kind of problem is called “classification” because we will be categorizing whether a person survived or not, as opposed to calculating a numerical value, which is called a “regression”. (Okay, I know we can just google who survived the Titanic, but we won’t learn anything by doing that, so we won’t resort to that).

https://www.kaggle.com/c/titanic

Data Collection

The first step in any Machine Learning project (after the problem is framed) is data collection. This is an integral part of any project — without data there would be nothing to analyze. Sometimes, it is even considered the hardest part. Kaggle saves us the time and effort by providing us the data already. You can head over to their website to download and view the data for yourselves.

Data Exploration

After getting the data, we can start to explore it. Kaggle provides us 2 files, a “train” and “test” file in CSV format. We can check the data out using any spreadsheet tool. I used LibreOffice Calc here. When you first read the column names, it doesn’t seem intuitive, Kaggle provides us a definition table for this. This simplification in column names is done for easier manipulation later on.

Kaggle Titanic Train Dataset
Kaggle Titanic Dataset Definition Table

Each record in the spreadsheet represent a person was onboard the Titanic and whether or not that person survived. The difference between the two files is the “Survived” column. The “train” data shows us who survived while the “test” data doesn’t. This is called the “label” and it is what we will be predicting later on. The other columns are called “predictors” and we will use these to make that prediction. Basically, our model will learn by looking for patterns from data with the “answers” on it (the train file), and it will use what it learns to try to guess the answers on data it hasn’t seen yet (the test file).

Now that we have an idea of the dataset, we can dive further using Python and Jupyter Notebook. Jupyter Notebook provides you a working environment for running Python scripts, and you can install it for free with Anaconda. When you open it for the first time, it will look like this.

Jupyter Notebook

We will first import the Pandas module and use it to load the data as a Python object that we can manipulate. These Python objects are called DataFrames.

import pandas as pdtitanic_train = pd.read_csv("train.csv")
titanic_test = pd.read_csv("test.csv") # Set aside for now, we will come back to it after we have our final model

We can look at the first 2 records of our data by using the following command. It should look the same as in your spreadsheet.

titanic = titanic_train
titanic.head(2)

Output:

First 2 Rows of Dataset

There are several commands to help you dive into the data. Let us use the “info” command.

titanic.info()

Output:

RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

From the table, we can see if there are any missing values and the data type for each column. If you notice, we have missing values for “Age” and “Cabin” column. We will need to do something about these later on when we prepare the data.

If you look into the “dtype” column, you will see that there are three kinds of values: int64, float, object. Int64 stands for integer. Both int64 and float columns appear to be numerical values in the dataset, while object columns appear to be categorical values. When we start to train our model, all our columns must be in numerical values for it to be understood. There are several ways to convert categorical values to numerical values. Let’s try it out on the “Sex” column. Currently, the values for it are either “male” or “female”, we can actually represent them as 1’s and 0’s, which is called binary data. This time we will be importing another module called Scikit-learn (or sklearn) to do this. Type in the following commands.

from sklearn.preprocessing import LabelBinarizerlb = LabelBinarizer()
titanic["Sex"] = lb.fit_transform(titanic["Sex"])

When you look back at your dataset, you should now see 1’s and 0’s instead.

We don’t actually need to use all of the columns in making our model. We can remove some of them like the “PassengerId” and “Name”, because it won’t have any bearing on the model. Converting and removing data are best left in the data preparation part. For now, we are more concerned studying data to get an idea on how we will prepare the data.

Moving on, we can generate some descriptive statistics using the “describe” command. It will show you the count, mean, standard deviation, minimum, maximum, and 25th, 50th, 75th percentile of each column. Generally, the data shows you the range of your values and where most of the data falls into.

titanic.describe()

Output:

Descriptive Overview of the Data

If you have a solid background on statistics, deriving information from the table is a piece of cake. I don’t. I like to visualize the data more to understand it, we can plot them using the “hist” command.

titanic.hist()

Output:

Histograms Plotted from the Data

Each numerical value column is plotted as a histogram showing the distribution of the values. We can see from the image that most of our data are skewing on one side of the graph and the range of values for each column is different (i.e. the “Age” column shows values from 0–80 while “SibSp” column only have values from 0–8). We will need to transform the values of each column to be on a similar range, so no column can overinfluence the entire model later on when we train it. This is called “standardization”.

Lastly (in data exploration), let us check the correlation of our columns with one another and to our “label”. The correlation measures how strongly two variables are related to one another. We can observe this by plotting a correlation matrix using the following commands. If you want to visualize it, you can also use a “heatmap” for this.

corr = titanic.corr()
corr

Output:

Correlation Matrix of the Data

Looking at the table, we can see that there is a visible relationship between “Fare” and “Pclass” columns. This is expected, “Pclass” represents the person’s socio-economic status. If we check the histograms, we can see that most of the “Pclass” falls on the value of 3 which means lower class, and most of the “Fare” falls on the cheaper side. We can make that assumption that people with lower economic status tend to have cheaper tickets. Based on this assumption, we can actually remove either of them because including both can be redundant for our model, since they both have similar implications. Sometimes, simple is better when it comes to training our model.

Data Preparation

Once we understand how we will approach the training data, we can now start to prepare our data for it. Assuming any future data that we get is similar to our training data now except in having the “Survived” column. We can actually automate the data preparation part by making what is called a “pipeline”, so all the considerations we derived from the data exploration part will be automatically processed in any future data we get.

When preparing for training, we will need to set aside a subset of the train data to be used for testing later on. We can consider this different from the test data provided by Kaggle, because we will actually use this to evaluate how our model performs. Later on it will make much more senses when we evaluate our model. For now, I will split my train data. I will set aside 10% as additional test data, and I want to make sure I get same “Survived” ratio on it as the overall data.

from sklearn.model_selection import StratifiedShuffleSplitsplit = StratifiedShuffleSplit(n_splits=1, test_size=0.1, 
random_state=42)
for train_index, test_index in split.split(titanic,
titanic["Survived"]):
strat_train_set = titanic.loc[train_index]
strat_test_set = titanic.loc[test_index]

We also want to separate the label from the predictors on the train data because we will perform the data preparation on the predictors only.

titanic = strat_train_set.drop("Survived", axis=1)
titanic_label = strat_train_set["Survived"].copy()

Our numerical data will be processed differently from our categorical data. We can create a separate pipeline for both and merge the two pipelines after.

# numerical data
num_attribs = ["Age", "SibSp", "Parch", "Pclass"]
# categorical data
cat_attribs = ["Sex", "Embarked"]

I only included the columns I am interested in. I removed “PassengerId”, “Name”, and “Ticket” because this information appear irrelevant. I removed “Cabin” because it had too many missing values to be useful. I also removed “Fare” because I think “Pclass” will suffice.

We have two steps in processing our numerical data. First, we need to fill in the missing value in the “Age” column and next we need to normalize the range of the data. We can use SimpleImputer and StandardScaler from sklearn.

# num pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("std_scaler", StandardScaler())
])

For our categorical data, we have the “Sex” and “Embarked” column. We know to convert the “Sex” column into binary data, but how about the “Embarked” column. Well, if we look into the column, we know that it has 3 values to choose from. We can actually create 3 different binary data columns for it. To do that, we will use OneHotEncoder. Before that, the “Embarked” column has missing value we need to fill. We will just use the most frequent value to fill it with.

from sklearn.preprocessing import OneHotEncoderembarked_pipeline = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent"),
("one_hot_encode", OneHotEncoder())
])

Now, we can then combine all our pipelines to form just one full pipeline. After running the full pipeline on the data, you should see that it has been transformed with all our adjustments.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("sex", OrdinalEncoder(), ["Sex"]),
("embarked", embarked_pipeline, ["Embarked"])
])
titanic_prepared = full_pipeline.fit_transform(titanic)
titanic_prepared

Output:

Prepared Data

Training and Evaluating Models

After our data has been prepared, it is now a matter of training different learning algorithms and choosing one that will eventually become our model. This is the tricky part. Why do we have to train different algorithms, why can’t we just settle on one now? Well, several algorithms have different way of doing things and although they have the same goal, the math behind it serves different purpose. As a beginner, you won’t immediately now which one will perform best on your data without trying each of them. I don’t even think professionals can know immediately without trying different algorithms.

I will be trying four different algorithms SGDClassifier, RandomForestClassifier, SVC, and KNeighborsClassifier. Each of these deserves their own article, but for now all you need to know is that each of them can predict classification. To evaluate the performance, I will be using a metric called “accuracy” for it. Accuracy determines the number of correct predictions over all predictions made. There are other metrics to choose from, but for now we can settle with “accuracy”.

One of the common problems in training models arises when the model performs really well on the data we trained it with and really badly on unseen data. This is called overfitting the model. The best way I can try to describe it is memorization vs comprehension. The model memorized the answers but it didn’t really understand it, when it came to giving it new data to work on, it gets lost easily. To prevent that from happening, we will be using a technique called “cross validation”. Cross validation works by training the model a number of times on the same dataset, but for each time, it separates a different subset of the data for testing and improving the model. In this way, the model is prevented from “memorizing” the answers. You will notice that it outputs 3 different scores, that is because it runs 3 times each.

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
sgd_clf = SGDClassifier(random_state=42)
cross_val_score(sgd_clf, titanic_prepared,
titanic_label, cv=3, scoring="accuracy")
# OUTPUT SCORES: [0.75655431, 0.73033708, 0.68539326]forest_clf = RandomForestClassifier(random_state=42)
cross_val_score(forest_clf, titanic_prepared,
titanic_label, cv=3, scoring="accuracy")
# OUTPUT SCORES: [0.80524345, 0.77153558, 0.79026217]svm_clf = SVC(random_state=42)
cross_val_score(svm_clf, titanic_prepared,
titanic_label, cv=3, scoring="accuracy")
# OUTPUT SCORES: [0.84269663, 0.81273408, 0.82397004]knn_clf = KNeighborsClassifier()
cross_val_score(knn_clf, titanic_prepared,
titanic_label, cv=3, scoring="accuracy")
# OUTPUT SCORES: [0.82771536, 0.79775281, 0.80524345]

Out of the four models, SVC performs the best, so we will choose this one, but let’s see if we can further improve the model. All of these models actually gives us access to a set of “hyperparameters” we can adjust to somehow control the learning process. Previously, we just set it default, but let’s see if we can further improve the score after adjusting. Let’s use GridSearchCV, which basically does the search for the best hyperparameters for us.

from sklearn.model_selection import GridSearchCVparam_grid = [
{'C': [1,10,100], 'gamma': [1,0.1,0.001], 'kernel':
['linear','rbf']}
]
svm_clf = SVC(random_state=42)grid_search = GridSearchCV(svm_clf, param_grid, cv=3,
scoring="accuracy",
return_train_score=True, verbose=10)
grid_search.fit(titanic_prepared, titanic_label)
cross_val_score(grid_search.best_estimator_, titanic_prepared,
titanic_label, cv=3, scoring="accuracy")
# OUTPUT SCORES: [0.84644195, 0.8164794 , 0.82397004]

Once it finishes searching, we can go ahead and evaluate it again. It seems that the score is still the same. This just means that the default hyperparameters performs the best for the model based on the set of hyperparameters given to it.

One last thing before we classify the Kaggle’s test data, let’s evaluate the model on unseen data. Remember the data with the labels we set aside earlier? Well, we will use the model now to classify it and compare it with the actual label.

from sklearn.metrics import accuracy_scorefinal_model = grid_search.best_estimator_unseen = strat_test_set.drop("Survived", axis=1)
unseen_label = strat_test_set["Survived"].copy()
unseen_prepared = full_pipeline.fit_transform(unseen)
predictions = final_model.predict(unseen_prepared)
accuracy_score(unseen_label, predictions)# OUTPUT SCORE: 0.8

I got a score of 0.8, which is not bad, and it is still close to the earlier scores, so now we know that the model is not overfitting the data!

Submitting to Kaggle

We can go ahead and plug in the test data from Kaggle to model. We won’t able to know our score yet in Jupyter Notebook, since the data has no labels. Let’s save it as a CSV file and submit it to the Kaggle site.

test = titanic_test
test_prepared = full_pipeline.fit_transform(test)
final_predictions = final_model.predict(test_prepared)
test["Survived"] = final_predictions
test[["PassengerId", "Survived"]].to_csv("submission.csv",
index=False)

Once you upload your submission, it will show you your score. My final score is 0.78, which I think is not bad, a little lower than expected but still near to our 0.8, which is good starting point. Kaggle allows you to further improve your model and submit entries multiple times to improve your score. I will leave that to you if you are interested in further building the model presented here, but for now, if you’ve been following along with your own Notebook, give yourself a high five because you just submitted your first Kaggle entry!

There’s no one way to do machine learning. I know there’s a lot to take in from this, but I hope this only encourages you more to pursue it further if this is something you are interested in. If you are thinking to yourself that is too difficult for you, I’ve been there, all I can say is a year ago I never knew any this. With time and effort, you too can begin your Machine Learning journey!

Cheers!

--

--

Thompson Go
Analytics Vidhya

Learning to code, growing up, and figuring out life.