Titanic- the ML challenge- a model build with Scikit learn Pipelines

Published in

Analytics Vidhya

6 min readApr 12, 2020

Growing up in India, Titanic was the first English movie that I would watch on cinema and it was an incredible experience. So, it is fitting that I pick this popular dataset for my first ML challenge.

Obviously in the movie, I was really sad that Jack ends up dying in the sea while Rose survives and lives on. This has haunted me ever since as I like happy endings!

Now armed with the data, I want to see if this was something dramatic and unfortunate as in the movie. Or will the data suggest that the odds were always against Jack and his death was inevitable!

The objective of this article is

1. Do some simple EDA to test our Jack survival hypothesis!

2. Build a simple and reusable machine learning workflow with Scikit learn’s Pipeline Architecture

3. Pick the best model (a random forest) and improve it further with Hyperparameter tuning

The entire Jupyter notebooks script can be found here and the model predictions do rank in the top 10% of Kaggle rankings!

The Jack Survival Hypothesis

After loading in the data, I decided to do some EDA on the titanic survival chances. The first thing I investigated was to build Seaborn’s count plot to check survival counts by sex. Very clearly, more men died in the disaster.

I next checked the survival ratio by sex and quite clearly, women had almost four time the survical chances as men on board! Things are looking bleak for Jack already.

Survival ratio by Sex

We also know that Jack was poor, which possibly means that he was travelling in the lower passenger classes. In this dataset, that means ‘PClass’ of 3. I went on to check if this meant anything in terms of survival and plotted the chart below.

And quite clearly, it does! The lower the class, the lower the survival chances!

So all in all, I have to reach a conclusion that Jack was doomed the moment the Titanic hit the Iceberg!

Feature Engineering with the Scikit learn Pipeline architecture

In this part, we will build a feature engineering strategy with Scikit learn’s Pipeline architecture. If you haven’t used Pipelines before I would strongly recommend that you try it. Pipelines help you chain all process involved in a typical machine learning workflow one after another, and can seriously assist in writing compact code. I was inspired in using Pipelines after reading Scikit learn core developer Andreas Muller’s interview published here. These were his words

For Scikit-learn, everybody should be using pipelines. If you aren’t using pipelines you’re probably doing it wrong

A great article which deals specifically with Scikit learn Pipelines can be found here

Now if you look around other high score Titanic ML scripts, you will see some incredible feature engineering and feature creation techniques. Compared to those, the steps here will be pretty simple. But unlike those methods, you can reuse the architecture we build here across other classification problems.

The data preparation we need to do are

(i) Imputing missing values in categorical and numerical columns

(ii) One-hot encoding categorical columns and

(iii) Scaling numerical columns

To do all of this in two lines of code, you can follow below steps in numeric and categorical pipelines.

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
numeric_transformer = Pipeline(steps=[(‘imputer’,SimpleImputer(missing_values=np.nan, strategy=’median’) ),(‘scaler’,StandardScaler())])
categorical_transformer = Pipeline(steps=[(‘imputer’, SimpleImputer(missing_values=np.nan, strategy=’most_frequent’)),(‘onehot’, OneHotEncoder(handle_unknown=’ignore’))])

Next we will use the ColumnTransformer to apply all the feature engineering very neatly and in a single step. As you can see we are building pipelines within pipelines here. Before building this we will create two lists of the numeric and categorical columns using the pandas dtype method

numeric_features = train_data3.select_dtypes(include=[‘int64’, ‘float64’]).columns
categorical_features = train_data3.select_dtypes(include=[‘category’]).columns
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(transformers=[(‘num’, numeric_transformer, numeric_features),(‘cat’, categorical_transformer, categorical_features)])

Building ML models with Scikit learn pipelines

Now that we have done all the feature engineering, it is time to test a few models. After applying train-test split to the features and labels, we will bring a few classifier models to try.

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# Let us instantiate individual classifers
SEED=123
lr = LogisticRegression(random_state=SEED,solver=’liblinear’)
knn = KNN()
dt = DecisionTreeClassifier(random_state=SEED)
gaussian_nb= GaussianNB()
randomforest=RandomForestClassifier(random_state=SEED)
# Define a list called classifier that contains the tuples (classifier_name, classifier)
classifiers = [(‘Logistic Regression’, lr),(‘K Nearest Neighbours’, knn),(‘Classification Tree’, dt),(‘Guassian Naive Bayes’, gaussian_nb),(‘Random Forest’, randomforest)]

We will apply these models as the second step in a new pipeline feeding from the pre-processor pipeline. To make things more efficient, we will run each model in an iterator and examine the results.

# Iterate over the defined list of tuples containing the classifiers
for clf_name, clf in classifiers:
# create the full pipeline to the training set
pipe = Pipeline(steps=[(‘preprocessor’, preprocessor),(‘classifier’, clf)])
pipe.fit(X_train, np.ravel(y_train))
# Predict the labels of the test set
y_pred = pipe.predict(X_test)
# Evaluate accuracies using cross_val_score
cv_scores = cross_val_score(pipe,X,np.ravel(y),cv=5).mean()
# print the cv_scores for each classifier
print(‘{:s} : {:.3f}’.format(clf_name, cv_scores))

And here are the accuracy scores we received

Hyperparameter tuning of the best models

While the KNN is our best model, I found out that the random forest model will seriously improve our performance if we went on a hyperparameter run.

I will run a randomized grid search below and I would recommend this over Grid Search anyday, as Grid Search can be really wasteful and time consuming.

# import randomized search for hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
rf_pipe = Pipeline(steps=[(‘preprocessor’, preprocessor),(‘classifier’, randomforest)])
# set the important parameters for random forest
# n_etimators will increase the number of trees built and bring in more diversity and a better model
# max_features, decreasing this limit from one will make the number of features low at every tree and bring more diversity in each tree output or less correlated trees
# n_jobs = -1 to use all CPU Cores (should be used within random search cv)
# max_depth to be limited to restrict over fitting and for model to finish quickly
# use random state to repeat the results
rf_param_grid = {‘classifier__n_estimators’: np.arange(100,1000,100),’classifier__max_features’: np.arange(0.05,1.05,.05),’classifier__max_depth’:np.arange(4,8,1)}
grid_mse_rf = RandomizedSearchCV(estimator=rf_pipe,param_distributions=rf_param_grid,n_iter= 25, cv=4, verbose=1,n_jobs=-1, random_state=123)
# Fit randomized_mse to the data
grid_mse_rf.fit(X_train, np.ravel(y_train))
# Print the best parameters and lowest RMSE
print(“Best parameters found: “, grid_mse_rf.best_params_)
print(“Highest score: “, np.sqrt(np.abs(grid_mse_rf.best_score_)))

Our random search ran 100 random forest models and selected the best hyperparameters. Our accuracy is now 92% which is a 15% jump from the untuned model!

Making Predictions on the test data

And before we finish, let us make those predictions in the test data and see one final power of the pipeline architecture. We will be able to make the predictions in one line of code! There is no need to examine and deal with individual columns of the test data.

# we have already loaded the test data to test_data dataframe
# let us drop the columns we dropped in the train data i.e. name, ticket and cabin
test_data3= test_data.drop(columns=[‘Name’,’Ticket’,’Cabin’])
# time to make predictions
pred = grid_mse_rf.predict(test_data3)
# now let us put this into the kaggle submission format
rf_output=pd.DataFrame({‘PassengerId’:test_data.index, ‘Survived’: pred})
rf_output

This model will fetch you a top 10% in Kaggle rankings which is not bad considering we have avoided much of the heavy lifting by using Pipelines. I hope you find this article useful and is encouraged in exploring more datasets with Scikit learn’s pipeline architecture.