Building a Machine Learning Model Step By Step With the Titanic Dataset

Taha Bilal Uyar
The Startup
Published in
12 min readSep 24, 2020

--

The sinking of the Titanic is one of the most unfortunate events in recent history. In this article, we create a machine learning model by using the survival data of this disaster.

Source: Wikipedia Commons

RMS Titanic sank on 15 April 1912 in the North Atlantic Ocean, when struck an iceberg. There were 2,224 passengers on board and this disaster resulted in the deaths of more than 1,500 people.

In this article, I will analyze the factors which are important for the survival ratio by using data visualization. After some feature engineering, I will build a machine learning model to predict survived passengers.

My aim in this post is to show how to handle a data set for beginners like me. Because of that reason, I shared almost every input and output.

The data set was taken from Kaggle, Titanic: Machine Learning from Disaster. You can find my original notebook which includes all codes in this post in this link.

Data set

train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
train_data.head()
train_data.shap
test_data.shape

We have two data sets that are train and test data sets. Our train data set has 11 features excluding the target column (survived). The train and test data sets have 891 and 418 rows respectively.

train_data.info()
test_data.info()

When we check the info of the train data and test data, we see dtypes of features and non-null values in the columns as we can see on the info of columns. There are missing values at Age, Cabin, and Embarked columns of train data. In the test data, Age, Fare, and Cabin columns have missing values. We will deal with them later.

Data Exploration and Visualization

Let’s analyze features by using some plots.

Gender

train_data[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean()
sns.countplot(train_data["Survived"], hue=train_data["Sex"])

When we look at the gender vs surviving rates, the results look interesting. The surviving ratio in females is much more than males’. According to this data, there is a strong correlation between survived and sex columns.

Port of embarkation

train_data[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean()
sns.countplot(train_data["Survived"], hue=train_data["Embarked"])

Embarked feature shows the port of embarkation which are C = Cherbourg, Q = Queenstown, and S = Southampton. According to the table which shows the surviving ratio regarding this feature, passengers who were embarked from Cherbourg have a higher surviving ratio.

Class of passengers

Pclass column involves 3 different values. 1st = Upper, 2nd = Middle, 3rd = Lower.

train_data[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean()
sns.countplot(train_data["Survived"], hue=train_data["Pclass"])

According to the data set, the class of passengers looks important for surviving. The majority of passengers from the upper class have survived. A large number of 3rd class did not survive.

Ticket fare

sns.catplot(x="Survived",y="Fare",data=train_data, kind="boxen")
g = sns.FacetGrid(train_data, col='Survived')
g.map(plt.hist, 'Fare', bins=20)

The ticket fares column is one of the numerical features. When we look at the histogram and boxen graphs, we can make a couple of comments about the correlation between fares and survival rates.
Average fare value for survived passengers much higher than not survived ones. Also, the histogram shows that most of the passengers have survived, who pay more than 100.
This result is expected because we can guess that there should be a correlation between pclass and fares.

Age

g = sns.FacetGrid(train_data, col='Survived')
g.map(plt.hist, 'Age', bins=15)

When we plot histogram chart for the “age” which is another numerical feature, we can get some useful tips. Children (Age<10) has high survival rates. Middle ages (25–35 years old) have surivived more than young ages (15–25 years old).

Number of family members

Parch and SibSp features are related with the family connections. Parch is number of parents and children. SibSp is number of sblings and spouse.

sns.countplot(train_data["Survived"], hue=train_data["Parch"])
sns.countplot(train_data["Survived"], hue=train_data["SibSp"])
train_data[['SibSp', 'Survived']].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

It looks like being single is not an advantage for surviving.

Correlation between features and target

We can plot a graph to see all correlations between numerical features. There are 2 similar functions to create this plot. One of them is pairplot from sns and the other one is plotting scatter_matrix from pandas.

pd.plotting.scatter_matrix(train_data, figsize=(12,12));

Also, we can use plots of data which is grouped according to the survived column. We can again use this plot for a quick look. We can see the histograms for both not survived and survived passengers respectively.

train_data.groupby(train_data["Survived"]).hist(figsize=(6,8))

Actually the best way to see the correlation between features is heatmap according to me. By using corr from pandas and heatmap from sns functions, we can have a very clear idea about the correlations for the numerical features.

sns.heatmap(cor, annot=True, fmt=".2f")

As we can see on the heat map there is a very strong correlation between surviving and plcass and fare as I mentioned before. However, the correlation is also very strong between the pclass and fare. So, we may not have 2 different features actually.
We know that sex feature has a strong correlation too, however it is not in heatmap because it is not numerical yet.

Feature Engineering and Modelling

We will start with random forest classification model for this data set. Its nonlinear nature makes it usually a great option for not only classification but also regression problems.
We will start with numerical features to create our machine learning model at the beginning.

X=train_data[["Pclass", "Age", "SibSp", "Parch", "Fare"]]
y=train_data["Survived"]

We split our data set into two parts which are train and test sets. It is necessary to check the model’s accuracy. We cannot be aware of a possible overfitting problem by using only one data.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

After splitting our data set into train and test tests, we create our model by using sklearn.RandomForestClassifier(). Then we fit our model to the train data then make prediction for test data.

from sklearn.ensemble import RandomForestClassifierrfc=RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc.score(X_test,y_test)

It doesn’t work as expected because we need to complete missing values before the deploying of model.

train_data.info()
test_data.info()

As we can see, there missing values in some features which are Age, Cabin, Fare, and Embarked. Firstly, we will get rid of NaN values in numerical features.

train_data["Age"].fillna(train_data["Age"].mean(), inplace=True)
test_data["Age"].fillna(test_data["Age"].mean(), inplace=True)
test_data["Fare"].fillna(test_data["Fare"].mean(), inplace=True)

NaN values are replaced with mean of the column by using fillna method of pandas in both train and test data.
Now, we can deploy our model.

X=train_data[["Pclass", "Age", "SibSp", "Parch", "Fare"]]
y=train_data["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)rfc=RandomForestClassifier(random_state=35)
rfc.fit(X_train, y_train)
print("test accuracy: ",rfc.score(X_test,y_test))

We could reach 68% accuracy by using only numerical features. We will try to increase our accuracy. We have some other features whose data type is an object. These features also have some correlation with surviving that we already saw in the previous part.
We have to convert our string values to numerical values. There are 2 main methods here. The first method is one hot encoding which creates a column for each unique value and fills these columns with binary codes 1 and 0. The other one is label encoding that replaces each unique string value with a unique number.
Both pandas and sckit learn have proper functions for this process. One hot encoding is possible with pandas.get_dummies and sklearn.preprocessing.OneHotEncoder. For label encoding, we can use pandas.factorize and sklearn.preprocessing.LabelEncoder functions.

train_data["Sex_encoded"]=pd.factorize(train_data["Sex"])[0]
train_data["Embarked_encoded"]=pd.factorize(train_data["Embarked"])[0]
test_data["Sex_encoded"]=pd.factorize(test_data["Sex"])[0]
test_data["Embarked_encoded"]=pd.factorize(test_data["Embarked"])[0]

We do not need to replace null values in the embarked column in train data. Because pandas.factorize function encodes null values separately.

train_data["Embarked"].unique()
train_data["Embarked_encoded"].unique()

NaN values replaced with -1 as you can see above.

Now, we can try one more time to deploy our model with new features.

X=train_data[["Pclass", "Age", "SibSp", "Parch", "Fare", "Sex_encoded", "Embarked_encoded"]]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)rfc=RandomForestClassifier(random_state=35)
rfc.fit(X_train, y_train)
print("test accuracy: ",rfc.score(X_test,y_test))

As we can see, there is an increase in accuracy which is almost 10%.

We can add one more feature to our data set, which is the cabin column. However, looks like it is not possible by using panda.factorize function directly because there are too many different values. Firstly, we will try to acquire only deck codes.
Firstly, we complete null values then we take only the first letter of each row by using str.slice function from pandas. In this way, we can acquire cabin codes.

train_data["Cabin"].unique()
train_data["Cabin"].fillna("N", inplace=True)
train_data['Cabin_code'] = train_data["Cabin"].str.slice(0,1)
train_data['Cabin_code'].unique()
test_data["Cabin"].fillna("N", inplace=True)
test_data['Cabin_code'] = test_data["Cabin"].str.slice(0,1)

We can use also this feature to train our model anymore.

X=train_data[["Pclass", "Age", "SibSp", "Parch", "Fare", "Sex_encoded", "Embarked_encoded", "Cabin_code_encoded"]]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)rfc=RandomForestClassifier(random_state=35)
rfc.fit(X_train, y_train)
print("test accuracy: ",rfc.score(X_test,y_test))

Adding this new feature to our X data set did not change the accuracy.

Random forest is a tree-based model so it doesn’t require feature scaling. Even it can soften the non-linear nature of the model. It is expected that result will be same after feature scaling. We can try and see.

from sklearn.preprocessing import StandardScalerscaler=StandardScaler()
X_train_sc=scaler.fit_transform(X_train)
X_test_sc=scaler.transform(X_test)
rfc=RandomForestClassifier(random_state=35)
rfc.fit(X_train_sc, y_train)
print("train accuracy: ",rfc.score(X_train_sc, y_train))
print("test accuracy: ",rfc.score(X_test_sc,y_test))

As you can see, the test accuracy is the same before and after feature scaling (79.1%). So, standardization of features did not help to increase accuracy as we expected.

Overfitting

When we look at the train and test accuracies, we can see that the difference between train accuracy is not too small. I think it can not be called overfitting but still, we can try to decrease the gap between accuracies.

If we try to fit our model to each feature column, it can give us an insight into which columns cause this slight overfitting.

rfc=RandomForestClassifier(random_state=35)
for x in X_train.columns:

rfc.fit(X_train[[x]], y_train)
print(x,"train accuracy: ",rfc.score(X_train[[x]], y_train)*100)
print(x,"test accuracy: ",rfc.score(X_test[[x]],y_test)*100)

We can make some comments on these results. The sex column is the most important feature for our model, only itself has 79% test accuracy. And it also has almost the same train and test accuracy which is a very ideal situation. While the difference between train and test accuracies is very little for some features, especially numerical features such as age and fare have a bigger difference which is between 6–10%.

We can convert the Age and Fare features from numerical to label encoded to increase the efficiency of the model.

Age

train_data["Age"].groupby(train_data["Survived"]).plot(kind="hist", bins=20, legend=True, alpha=0.5)
train_data.loc[train_data['Age'] <= 7.5, 'Age_encoded'] = 0
train_data.loc[(train_data['Age'] > 7.5) & (train_data['Age'] <= 15), 'Age_encoded'] = 1
train_data.loc[(train_data['Age'] > 15) & (train_data['Age'] <= 25), 'Age_encoded'] = 2
train_data.loc[(train_data['Age'] > 25) & (train_data['Age'] <= 30), 'Age_encoded'] = 3
train_data.loc[(train_data['Age'] > 30) & (train_data['Age'] <= 35), 'Age_encoded'] = 4
train_data.loc[(train_data['Age'] > 35) & (train_data['Age'] <= 50), 'Age_encoded'] = 5
train_data.loc[train_data['Age'] > 50, 'Age_encoded'] = 6
train_data["Age_encoded"].unique()
train_data[['Age_encoded', 'Survived']].groupby(['Age_encoded'], as_index=False).mean()
sns.countplot(train_data["Survived"], hue=train_data["Age_encoded"])

Fare

train_data["Fare"].groupby(train_data["Survived"]).plot(kind="hist", bins=20, legend=True, alpha=0.5)
train_data.loc[train_data['Fare'] <= 12.5, 'Fare_encoded'] = 0
train_data.loc[(train_data['Fare'] > 12.5) & (train_data['Fare'] <= 25), 'Fare_encoded'] = 1
train_data.loc[(train_data['Fare'] > 25) & (train_data['Fare'] <= 50), 'Fare_encoded'] = 2
train_data.loc[(train_data['Fare'] > 50) & (train_data['Fare'] <= 75), 'Fare_encoded'] = 3
train_data.loc[(train_data['Fare'] > 75) & (train_data['Fare'] <= 100), 'Fare_encoded'] = 4
train_data.loc[(train_data['Fare'] > 100) & (train_data['Fare'] <= 150), 'Fare_encoded'] = 5
train_data.loc[train_data['Fare'] > 150, 'Fare_encoded'] = 6
train_data["Fare_encoded"].unique()
train_data[['Fare_encoded', 'Survived']].groupby(['Fare_encoded'], as_index=False).mean()
sns.countplot(train_data["Survived"], hue=train_data["Fare_encoded"])

As you can see from the above, Age and Fare columns were converted to discrete values from continuous ones. I tried to determine critical points and cut the features from those points.

I did it manually but there some functions for it in pandas. These are pandas.cut() and pandas.qcut() functions. They cut the values according to values and frequencies respectively.

Now, we can try our model with new features.

X=train_data[["Pclass", "Age_encoded", "SibSp", "Parch", "Fare_encoded", "Sex_encoded", "Embarked_encoded", "Cabin_code_encoded"]]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)rfc=RandomForestClassifier(random_state=35)
rfc.fit(X_train, y_train)
print("train accuracy: ",rfc.score(X_train, y_train))
print("test accuracy: ",rfc.score(X_test,y_test))

Our previous train and test accuracies were 98.2% and 79.1 respectively. Now train accuracy is 92.0% and test accuracy is 78.7%.

Converting the features to discrete values decreased the gap between train and test accuracies however, did not help to increase the test accuracy.

Hyperparameter Tuning

Now, we have the model and final data set. Our model has many arguments (n_estimators, criterion, max_depth, min_samples_leaf, etc.) and we need to find the best values for these parameters. For this purpose, we won’t try the parameters manually, we will try GridSearchCV and RandomizedSearchCV from sklearn and compare them.
GridSearchCV is an exhaustive search over specified parameters. It tries each combination in the grid of hyperparameter values and may take a long time if we have a large data set and too many hyperparameters. On the other hand, RandomizedSearchCV selects random combinations and mostly it finds the best parameters in a shorter time.

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import time
rfc_parameters = {
'n_estimators': [100,200,500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [6,8,10],
'criterion' :['gini', 'entropy'],
'min_samples_split': [2, 4, 6]
}

We determined possible values for some of the important arguments and searching methods will work on these parameters.

Randomized Search

start_time = time.time()

rand_search= RandomizedSearchCV(rfc, rfc_parameters, cv=5)
rand_search.fit(X_train, y_train)
print(rand_search.best_params_)
print("best accuracy :",rand_search.best_score_)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

Grid Search

start_time = time.time()

grid_search= GridSearchCV(rfc, rfc_parameters, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
print("best accuracy :",grid_search.best_score_)

end_time = time.time()
print("Total execution time: {} seconds".format(end_time - start_time))

Best parameters and best accuracy values are determined by both randomized search and grid search. While the best accuracy is around 83% in both of them, the cost of the grid search is much higher. There is a huge difference which is more than ten times in the total execution times (Randomized search: 20 seconds, Grid search: 517 seconds). So, using the random search algorithm is a good decision in terms of both accuracy and cost.

Our best accuracy value is 83.1% which is not bad.

Conclusion

In this post, I analyzed a data set in terms of all aspects then create a machine learning model. Firstly, data were explored in detail to understand it by using plots. Then, data was cleaning and engineering made step by step. A machine learning model is chosen and the efficiency of the model is checked in every step. Finally, hyperparameter tuning is made by both random search and grid search, and these searching methods are compared in terms of efficiency.

--

--