Why Rose survived from Titanic but Jack did not——an explanation given by SHAP

Meng Liao
5 min readSep 23, 2019

On April 10, 1912, the Titanic left Southampton, then called at Cherbourg, France and Queenstown (now Cobh), Ireland, and headed for New York. Four days after sailing, on 14 April, the ship hit an iceberg at 23:40. Two and a half hours later, the whole ship sank. 2,224 passengers and crew on board, more than 1,500 died, making it one of the deadliest maritime disasters in modern history.

In Titanic’s film (1997), 1st class passenger Rose, her fiancé and mother embarked at Southampton, began their journey to New York. At the same time, a penniless young artist Jack won a 3rd class ticket in a poker game and boarded the ship at the last minute. At the end of this love story, Rose and her fiancé survive, but Jack doesn’t. Why it happened, let’s explain the reason using the SHAP model.

SHAP (SHapley Additive exPlanations) model was proposed by Scott Lundberg which is a powerful model can explain any machine learning model. The core of SHAP model is to calculate SHAP value. The SHAP value, simply speaking, is the contribution of one feature for making the one prediction.

We use the Kaggle titanic data set, the data include the surviving results of 891 passengers and their personal information. After the data cleaning, we choose the gender, class, fare, age, family size, special title of the name, and embarked port as our features. Then we train an XGBoost regression model to predict the survival probability of passengers, the prediction results are between 0 and 1, 0 means dead and 1 means survival. Once the training finished, we use the SHAP model to explain the XGBoost model and its prediction.

All demo code is in this notebook: https://www.kaggle.com/meliao/shap-on-titanic-why-is-rose-alive-but-jack-not/notebook

SHAP value
SHAP value of every passenger and the feature importance determined by SHAP values.

In the above figure, each point represents a passenger, the horizontal axis is SHAP value (the contribution of this feature to the survival probability). The color of the point indicates the feature value across the entire feature range. For example, the blue points in age represent children and the red point is elder. The features are sorted from top to bottom in order of decreasing importance. From the figure we can see some results that match the intuition such as females (red) have survival advantage; the higher the class, the more survival advantage; young people (blue) have more survival advantage. We find also some interesting phenomena like: people with family members have a survival advantage over alone, but if there are too many family members, it is worse than alone.

Then, we study the cross-influence from two features: sex and class.

Dependence plot (Pclass/Sex) of SHAP values.

From the above dependence figure, we find that from the first class to the third class, both males (blue) and females (red), their survival advantages have been weakened, but this weakening is particularly evident in women. We can think that because the male survival rate is already low, the impact of class on male survival is not so sensible. However, the female survival rate is much higher, therefore, whether the woman is in the third-class becomes an important condition for females’ survival. So we can say that the surviving bias due to the class is more pronounced in females.

Next, we use the SHAP value clustering for each passenger.

Upper: hierarchical clustering of SHAP values of passengers. Lower left: TSNE of SHAP values of passengers, orange passengers are the same passengers between orange lines in the upper figure. Lower right: scaled average feature values of orange passenger and other passengers.

In the sorted hierarchical clustering figure, we find several special passengers between two vertical orange lines, they have low survival chance but their SHAP value is similar to the passengers who have a high survival chance. We project these special passengers (orange points) on a TSNE figure of SHAP values, we also find that they are clustered apart from others. We then average their feature value and compare with the average of others, we find this group represents some third-class young boys who are from some very big families. So, the prediction model thinks if you are in this group, unfortunately, you will have very little chance to survive.

Finally, to demonstrate how the SHAP explicate the prediction for the individual cases, we reproduce personal information about the Rose, Jack and Rose’s fiancé based on the movie Titanic (1997)

Personal information about the Rose, Jack and Rose’s fiancé
Survival chance prediction of Rose, Jack and Rose’s fiancé and SHAP explanations

The above figure shows the prediction results (output value) from the XGBoost model and the explanations from SHAP. The base value is the overall average survival chance. (It means the prediction that we can make when we do not know any feature values of one passenger. ) The red feature values push the prediction to the positive direction, and the blue features do the inverse.

For Rose, the model predicts that its survival will be almost 100%. SHAP tells us that the result is based on the fact that she is a woman, in the first class, her fare is high and that there are several family members. Conversely, the XGBoost model has an extremely negative prediction for Jack because he is male, in third class, and he is 20 years old. The prediction of fiancé’s survival chance is 55%, because although he is a male with the age of 24 years which are disadvantages for surviving, his first-class, high fare and the family members compensating these negative effects.

In the end, we have a question, is there any solution could help Jack survive? According to the above study, a young age may increase survival chance, so we suppose that Jack’s real age is only 16, moreover, he mays have a better chance and won a 2nd class ticket at the beginning. We make a prediction with his new profile.

Survival prediction with Jack’s new profile.

We found that Jack’s survival chance increased to 18%, although this is still a very low probability, the survival chance with the new profile is already 3 times higher than the original one.

--

--