Machine Learning Prediction Model - Why did Kate survive Leo in Titanic?

Published in

han_

8 min readMar 8, 2017

Spoiler alert: Jack doesn’t make it.

It all started so beautifully- frolicking in cars and painting nudes. But in the end, only Rose was able to survive the Titanic disaster. Why? It just so happens that Kaggle makes the real Titanic dataset readily available. Let’s explore what conclusions I can draw about passenger survival from the data.

My goal was to:

1. Build a model to predict survival for passengers of the Titanic
2. Understand the top contributing factors to survival

Data Acquisition and Processing

The data set included key attributes about each passenger on the Titanic.

Notable columns:

Survived- 1 for survived

Pclass- first class, second class, third class

SibSp- refers to # of siblings / spouse(s) aboard the ship

Parch- refers to # of parents / child(ren) aboard the ship

Embarked- port of departure, C = Cherbourg, Q = Queenstown, S = Southampton

Viewing the characteristics of the dataset shown on the left, there are a total of 891 passengers. Most columns are populated- however Cabin is severely underpopulated and some entries contained multiples cabin numbers per row. To make our calculations simpler, I dropped this column. The fare class is also a good indicator of cabin location, so that made the cabin variable redundant. For the missing value in Age, I imputed a value, and then created dummy columns for categorical values, while standardizing the continuous features.

from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
impute=Imputer()
scaler=StandardScaler()numcols=['Age', 'Fare']for c in numcols:
    dff[c]=impute.fit_transform(pd.DataFrame(dff[c]))
    dff[c]=scaler.fit_transform(pd.DataFrame(dff[c]))catcols=['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']for c in catcols:
    dummies=pd.get_dummies(dff[c], drop_first=True, prefix=c)
    dff=pd.concat([dff, dummies], axis=1)

Modeling the Data

X=dff.iloc[:,8:]
X['Age']=dff['Age']
X['Fare']=dff['Fare']
y=dff['Survived']

Setting our target to be the survival of each passenger, and the rest of the data as features, I tried a couple different classification models. I used gridsearch to iteratively optimize the hyperparameters for each model, then fit the model and scored it against the true survival rate.

Logistic Regression:

from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifierLR=LogisticRegression()C_vals = [0.0001, 0.001, 0.01, 0.1, .15, .25, .275, .33, 0.5, .66, 0.75, 1.0, 2.5, 5.0, 10.0, 100.0, 1000.0]
penalties = ['l1','l2']gs = GridSearchCV(LR, {'penalty': penalties, 'C': C_vals}, verbose=False, cv=5)
gs.fit(X, y)
gs.best_params_coef=pd.DataFrame(gs.best_estimator_.coef_, columns=X.columns).transpose()
coef['abs']=coef[0].apply(np.abs)
coef.sort('abs', ascending=False)

The top two factors that contribute to non survival, output from the logistic regression model, are being male, being in third class. This makes sense in the context of an evacuation. “Women and children first” is the mantra. And being in third class automatically means you are the last to get access to the lifeboats. Additionally, Jack was a male in third class..I think we are on to something here.

To evaluate the performance of the model, I can use an ROC curve. I plotted the ROC curve to see how the model responds to the shifting of the classification threshold.

def auc_plotting_function(rate1, rate2, rate1_name, rate2_name, curve_name):
    AUC = auc(rate1, rate2)
    plt.figure(figsize=[8,6])
    plt.plot(rate1, rate2, label=curve_name + ' (area = %0.2f)' % AUC, linewidth=4)
    plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel(rate1_name, fontsize=18)
    plt.ylabel(rate2_name, fontsize=18)
    plt.title(curve_name + " for Survival Rate", fontsize=18)
    plt.legend(loc="lower right")
    plt.show()# plot receiving operator characteristic curve
def plot_roc(y_true, y_score):
    fpr, tpr, _ = roc_curve(y_true, y_score)
    auc_plotting_function(fpr, tpr, 'False Positive Rate', 'True Positive Rate', 'ROC')from sklearn.metrics import roc_curve, auc
y_score = gs.best_estimator_.decision_function(X)
plot_roc(y, y_score)

ROC of the survival rate for the logistic regression model

The ROC depicts the relationship between True Positive Rate and False Positive Rate as the threshold is shifted. What this graph shows is that we are able to significantly increase True Positive Rate up to about 0.8 without sacrificing much in False Positive Rate (~0.3). This indicates that the model can be tuned with the threshold to hit that point. However, depending on the goal of the model, whether optimizing for True Positive Rate or False Positive Rate is more important, the model can be tuned appropriately. In a disaster scenario, contingency plans want to be built with more redundancy in place. I would shift the model towards a higher True Positive Rate so that the conservative or worst-case-scenario is planned for.

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
ypred=gs.best_estimator_.predict(X)
print classification_report(y, ypred)
pd.DataFrame(confusion_matrix(y, ypred, labels=[0, 1]))

The classification report provides data on the precision and recall for the model. Precision describes incidence of false positives- of the positives we identified, how precise was our assignment of these positives. This number is the ratio between true positives and total positives predicted. Recall describes incidence of false negatives- of the positives in the sample, how prolific was our identification of these positives. This number is the ration between the true positives identified and the total positives in the sample. Typically, there is a tradeoff between these two metrics. When a model identifies most data points as positive, it may be able to pick many of the positives out of the sample (high recall), but also be assigning too many positives (low precision).

Precision and recall - Wikipedia

In information retrieval, a perfect precision score of 1.0 means that every result retrieved by a search was relevant…

en.wikipedia.org

Classification report and confusion matrix for the logistic regression model

Our average precision and recall is 81% and 82% respectively. This will allow us to compare this model performance to others. The confusion matrix shows on the y the correct labels on the y axis and the predicted labels on the x axis. This means the number of incorrect predictions are (101+63)=164.

The f1-score is a good summary statistic that takes into account of both the recall and precision. The overall f1 score of this model is 81%.

K Neighbors Classifier:

KN=KNeighborsClassifier()n_neighbors = [3,5,7,9]
leaf_size = [25,30,35,40]gsk = GridSearchCV(KN, {'n_neighbors': n_neighbors, 'leaf_size': leaf_size}, verbose=False, cv=5)
gsk.fit(X, y)
gsk.best_params_ypred=gsk.best_estimator_.predict(X)
print "            -----Classification Report-----"
print classification_report(y, ypred)
print "-Confusion Matrix-"
pd.DataFrame(confusion_matrix(y, ypred, labels=[0, 1]))

Running the K Neighbors Classifier, the f1 score is 86%, which outperforms the logistic regression model. The total number of inaccurately classified points are (75+50)=125. The model tends to overpredict 0’s for 1’s (75) versus predicting 1’s for 0’s (50), in other words, this model tends to skew on the conservative side by predicting more non-survival cases. This would be useful for disaster management.

The downside of using a K Neighbors Classifier is that there is no information we can extract about which features are more important towards making predictions. Let’s try two additional (more advanced) models.

Random Forest Classifier:

from sklearn.ensemble import RandomForestClassifier
forest=RandomForestClassifier()n_estimators = [15,20,25,30,35]
max_features = [0.2,0.4,0.6,0.8,0.99]
max_depth = [1,2,3,5]gsf = GridSearchCV(forest, {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth':max_depth}, cv=5)
gsf.fit(X, y)
gsf.best_params_ypred=gsf.best_estimator_.predict(X)
print "            -----Classification Report-----"
print classification_report(y, ypred)
print "-Confusion Matrix-"
pd.DataFrame(confusion_matrix(y, ypred, labels=[0, 1]))

Classification report and confusion matrix for the Random Forest model

The random forest model has a f1 score of 84% which slightly underperforms the KNN Classifier. This model also tends to over-classify non-survivors. We can view the feature importances for this model to see which factors contribute the most to survival.

pd.DataFrame(gsf.best_estimator_.feature_importances_, index=X.columns, columns=["Feature Importance"]).sort("Feature Importance", ascending=False)

Top 5 feature importances for Random Forest Classifier.

A look at the top contributing features, male and third class are top contributors. Fare is also important, which is related to the type of fare class. Unfortunately, unlike logistic regression models, Random Forest models do not provide any indication whether a feature contributes to survival and non survival, only it’s relative importance versus other features.

Gradient Boosting Classifier:

from sklearn.ensemble import GradientBoostingClassifier
Grad=GradientBoostingClassifier()n_estimators = [15,20,25,30,35]
max_features = [0.001, 0.05, 0.1, 0.2,0.4]
max_depth = [1,2,3,5,7,9,11]gsg = GridSearchCV(Grad, {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth':max_depth}, cv=5)
gsg.fit(X, y)
gsg.best_params_ypred=gsg.best_estimator_.predict(X)
print classification_report(y, ypred)
pd.DataFrame(confusion_matrix(y, ypred, labels=[0, 1]))

The Gradient Boosting Classifier has a f1 score of 85% which slightly underperforms the KNN Classifier but beats the Random Forest Classifier. This model also tends to over-classify non-survivors also. The number of total incorrectly classified passengers (133) is very close in number to the KNN (125). The difference is that the KNN is more evenly distributed between false positives and false negatives. The choice of model is informed by the application- if we value a conservative model, this is the one. Otherwise, a more balanced model is the KNN. We can view the feature importances for this model to see which factors contribute the most to survival.

pd.DataFrame(gsf.best_estimator_.feature_importances_, index=X.columns, columns=["Feature Importance"]).sort("Feature Importance", ascending=False)

Top 5 feature importances for Gradient Boosting Classifier.

Encouragingly, we see the same features appear- male, 3rd class, fare, and age. Using these models, we have identified important factors to survival rate and predicted survival rate with good accuracy.

Key Takeaways and Conclusion

We have identified a number of models that are able to use this Titanic data to predict passenger survival rate. Depending on the application of the model- whether false positives are preferred over false negatives, a different classifier can be chosen. Overall, all of our models performed relatively well with an f1-score of ~81%.

We were also able to understand better the top contributors to survival rate. The encouraging sign is that all of our models returned the same features as crucial to determining survival. Males, in the third class, who paid a lower fare were more likely not to survive the disaster.

This quantitive data backs up what we’ve seen on the big screen. Jack didn’t stand much of a chance after all- he was male and in a lower fare class.