Machine Learning with python: EDA, cleaning, feature engineering and Ensemble model assessment

Published in

Analytics Vidhya

9 min readMay 23, 2020

The Titanic dataset is a good playground to practice on the key skills of data science. Here I want to show a complete tutorial on exploratory data analysis, data cleaning, feature engineering and model selection with python, pandas, seaborn and matplotlib and finally sklearn. Then we will approach hyperparameters tuning with GridSearchCV and finally stack our models with ensemble.

The full code of this article is available in github.

1. Start importing libraries

Ok the basic is of course to import all the libraries we need along the way: we will use pandas and seaborn to handle, clean and visualize our dataset. “re” will be usefull to handle strings and LabelEncoder will be used to encode categorical variables into numeric ones.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
import re

2. Exploratory data analysis

we import our dataset, downloaded from kaggle:

df = pd.read_csv('train.csv')

So the columns report the PassengerId, the class of the passenger, its name, sex, age, if it was with a Sibling/Spouse, if it had Parent-child, the fare paid and the Cabin and finally where the passenger was embarked.

We check with info method the dtypes and non-null counts for our columns and we see that Cabin is pratically always missing while Age has some missing values. Also note that embarked has two missing values.

Before fixing the problems we want to do some exploratory data analysis. Let’s check if we can find correlations between the variables we have and the output (survived yes/no). We can have a first glimpse of our data by plotting the correlation heatmap with seaborn.

The default method of correlation with .corr() is pearson but we can also introduce spearman correlation coefficient. The difference between the two is highlighted below where we have a non linear relationship between the independent and dependent variable and the spearman correlation is 1.

Difference between spearman and pearson correlations.

We can use both of them with the .corr() method with a pandas dataframe, the default if pearson.

Correlation heatmap with spearman and pearson methods visualized with seaborn for the Titanic dataset.

From our heatmaps we can see that there’s a negative correlation with Pclass and positive with Fare (both indicates that the more you paid, the more likely you survived). However the correlation with Age, SibSp and Parch is almost zero. Since we did not encode our variables yet here, we don’t have the correlation with Sex and Embark.

We can do nice looking graphs with matplotlib and explore more variables like sex and Pclass.

survivors_data = df[df.Survived==True]
non_survivors_data = df[df.Survived==False]
# calculate values for each survival status
survivors_gender = survivors_data.groupby(['Sex']).size().values
non_survivors_gender = non_survivors_data.groupby(['Sex']).size().values# calculate totals for percentates
totals = survivors_gender + non_survivors_gender# use calculate_percentage_function to calculate percentage of the total
data1_percentages = calculate_percentage(survivors_gender, totals)*100 
data2_percentages = calculate_percentage(non_survivors_gender, totals)*100gender_categories = ['Female', 'Male']f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,5))
# plot chart for count of survivors by class
ax1.bar(range(len(survivors_gender)), survivors_gender, label='Survivors', alpha=0.5, color='g')
ax1.bar(range(len(non_survivors_gender)), non_survivors_gender, bottom=survivors_gender, label='Non-Survivors', alpha=0.5, color='r')
plt.sca(ax1)
plt.xticks([0.4, 1.4], gender_categories )
ax1.set_ylabel("Count")
ax1.set_xlabel("")
ax1.set_title("Count of survivors by gender",fontsize=14)
plt.legend(loc='upper left')# plot chart for percentage of survivors by class
ax2.bar(range(len(data1_percentages)), data1_percentages, alpha=0.5, color='g')
ax2.bar(range(len(data2_percentages)), data2_percentages, bottom=data1_percentages, alpha=0.5, color='r')
plt.sca(ax2)
plt.xticks([0.4, 1.4],  gender_categories)
ax2.set_ylabel("Percentage")
ax2.set_xlabel("")
ax2.set_title("% of survivors by gender",fontsize=14)

And the same code applies for every other variable:

So we have a glimpse about who most likely survived: female and upperclass. With seaborn we can group together the informations of Sex, Age, Pclass:

3. Data Cleaning

Let’s do some cleaning. We start with Age: since we see that class and sex might impact our model, we will impute the mean of the category of each group instead of the simple mean and median:

df['Filled_age'] = df['Age'].fillna(df.groupby(['Pclass','Sex'])['Age'].transform(np.mean))

As for the Cabin, we will try to just keep what we have for now, retaining the Deck by filling missing values with a string and then selecting the first letter of the string in the Cabin. The less common cabins are changed to “other”.

df['Cabin'].fillna('Missing',inplace=True)
df['Deck'] = df.Cabin.apply(lambda x: x[0])
deck_mapping= {'M':'M','C':'C','B':'B','D':'D','E':'E','A':'other','F':'other','G':'other','T':'other'}
df['Deck'] = df['Deck'].map(deck_mapping)

As for the embarked place, we fill na values with the most common value.

df['Embarked'].fillna('S',inplace=True)

4. Feature engineering

We get dummies encoding for Ses, Deck and Embarked location, we also drop the original columns as well as PassengerId and Ticket:

dummies = pd.get_dummies(df[['Sex','Embarked','Deck']], drop_first=True)
df2 = pd.concat([df,dummies],axis=1).drop(['Sex','Embarked','Deck'],1).drop(['PassengerId','Age','Cabin','Ticket'],1)

Notice we used “drop_first=True” into get_dummies function. This will drop one column for each category since it’s redudant (for this dataset if male = 0 then female = 1).

Next we want to address the Name, Fare and SibSp, Parch variables. We recognize the possibility that families had different survival chances and that people traveling alone had different chances as well:

df2['FamilySize'] = df2['Parch']+df2['SibSp']+1
df2['IsAlone'] = 1 #initialize to yes/1 is alone
df2['IsAlone'].loc[df2['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1

Next we extrapolate from the Name the Title (‘Mr’,’Miss’, etc) and the Surname, also we want to count the total number of Surnames which should give us the same number as FamilySize, for that reason, we normalize also the fare for the family size:

df2['Surname'] = df2['Name'].str.split(',').str[0]
df2['Title'] = df2['Name'].str.split(',').str[-1].str.split('.').str[0]   
df2['Surname_counts'] = df2.groupby('Surname')['Name'].transform('count')
df2['Surname_Family_counts'] = df2.groupby(['Surname','Parch','SibSp'])['Name'].transform('count')
df2['Norm_Fare'] = df2['Fare']/df2['Surname_Family_counts']

Next we bin variables like the Title, keeping the most frequent ones, using quantile cut method of pandas for Fare and transforming age into 5 bins as well:

top5 = df2.Title.value_counts().nlargest(5).index
df2['Less_Title'] = df2.Title.where(df2.Title.isin(top5), other='Other')
df2['FareBin'] = pd.qcut(df2['Norm_Fare'], 4)
df2['AgeBin'] = pd.cut(df2['Filled_age'].astype(int), 5)label = LabelEncoder()
    
df2['Title_Code'] = label.fit_transform(df2['Title'])
df2['AgeBin_Code'] = label.fit_transform(df2['AgeBin'])
df2['FareBin_Code'] = label.fit_transform(df2['FareBin'])df2.drop(['Name','Surname','Title','Ticket2','Filled_age','Ticket_num','Fare','FareBin','AgeBin','Less_Title'],1,inplace=True)

Our dataset is not complete:

We can check again the correlations to see if we get some more insight.

Now we can see that Pclass, Sex_male and FareBin_code are the most useful predictors.

5. Model selection

We import a bunch of modules!

To test our results we want to check the simplest model and compare results with it: so we keep our model simple and pick just 4 variables, we split our dataset into training and test dataset and then use predictions.

81% accuracy is not bad, but we can improve. Let’s check if some more complicated models can perform better. We will import a DecisionTreeClassifier, a RandomForestClassifier, a Bagging, AdaBoost and Gradient BoostingClassifier and a KNN.

lr = LogisticRegression()
dtree = DecisionTreeClassifier(random_state=42,max_depth=3,min_samples_leaf=1)
rfc = RandomForestClassifier(random_state=42,max_depth=3)
bc = BaggingClassifier(base_estimator=lr, n_estimators=300, n_jobs=-1)
knn = KNeighborsClassifier(n_neighbors=5)
ada = AdaBoostClassifier(base_estimator=dtree,n_estimators=300,random_state=42)
bg = GradientBoostingClassifier(random_state=42, max_depth=2, subsample=0.8,n_estimators=300)estimators = [('dtree', dtree),('bc', bc), ('rfc', rfc), ('ada', ada),('bg',bg),
              ('lr', lr),  ('knn',knn)]

we now compare each model, to do so we organize our results in a dataframe

results = []for name,model in estimators:
    model.fit(X_train_std,y_train)
    preds = model.predict(X_test_std)
    rocauc = metrics.roc_auc_score(y_test, preds)
    f1 = metrics.f1_score(y_test, preds)
    prec = metrics.precision_score(y_test, preds)
    rec = metrics.recall_score(y_test, preds)
    accu = metrics.accuracy_score(y_test, preds)
     
    try:
        scores = cross_val_score(model, X, y, scoring='accuracy', cv=5)
        scores_mean = scores.mean()
        scores_std = scores.std()
        predict_prob = model.predict_proba(X_test)[:,1]
        roc_auc_model = roc_auc_score(y_test, predict_prob)
        
    except:
        scores_mean = np.nan
        scores_std = np.nan
        print('model ',name,' has not predict_proba')
        roc_auc_model = np.nan
        
    data=[name,accu,rocauc,roc_auc_model,f1,prec,rec,scores_mean, scores_std]    
    results.append(data)mod_res = pd.DataFrame(results,columns=['model','accuracy','rocauc','roc_auc_prob','f1','prec',
                              'rec','scores_mean','scores_std']).sort_values(by='scores_mean',ascending=False)

Tough game! RandomForest, DecisionTree , LogisticRegression and BaggingClassifier all have almost the same accuracy. RFC showed highest precision while LR and BC higher recall.

We can check also the feature importances of RFC for example. We retrieve some knowledge we had: sex was the most important feature, followed by Pclass and FamilySize.

Full Variables

Let’s check model performance with all variables:

target = 'Survived'
X = df2.drop('Survived',1)
y = df2['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

We then use the same code as above with those results:

So, DecisionTree and RandomForest are performing better even tho the accuracy is still around 82%. From feature importances plot we see that Title_code with FamilySize, Norm_Fare, Deck_M all played a role in the model. How can we tune the models?

6. HyperParameters Tuning

GridSearchCV library

We will use GridSearchCV to tune our models parameters. The default value for refit in GridSearchCV is True so it will refit our models with the best parameters.

GridSearchCV(
    estimator,
    param_grid,
    scoring=None,
    n_jobs=None,
    iid='deprecated',
    refit=True,
    cv=None,
    verbose=0,
    pre_dispatch='2*n_jobs',
    error_score=nan,
    return_train_score=False,
)

Below the code splitted for each model. What we will do is create a grid of parameters for each model and let GridSearchCV find the best parameters with 6 cross validation with scoring = ‘accuracy’.

cv = 6
scoring='accuracy'print('knn')
param_grid = {'n_neighbors':np.arange(2,30,2),
             'p':[1,2,3]}gs = GridSearchCV(knn,param_grid,scoring=scoring,n_jobs=-1, cv=cv, verbose=1,)gs.fit(X_train,y_train)print('best score:', gs.best_score_, ' best params', gs.best_params_,' best test score', gs.score(X_test,y_test))print('LR')
param_grid = {'penalty' : ['l1', 'l2', 'elasticnet'],
             'C':[1,2,2.2,2.5],
             'class_weight':[None,'balanced']}gs = GridSearchCV(lr,param_grid,scoring=scoring,n_jobs=-1, cv=cv, verbose=1)
gs.fit(X_train,y_train)print('best score:', gs.best_score_, ' best params', gs.best_params_,' best test score', gs.score(X_test,y_test))print('DTREE')
param_grid = {'max_depth':[3,4,5,6,7],
             'class_weight':[None,'balanced'],
             'criterion':['gini'],
             'min_samples_split':[1,2,5,10],
             'min_samples_leaf':[0.005,0.01] ,
             'ccp_alpha':[0,0.005,0.007]}gs = GridSearchCV(dtree,param_grid,scoring=scoring,n_jobs=-1, cv=cv, verbose=1)
gs.fit(X_train,y_train)print('best score:', gs.best_score_, ' best params', gs.best_params_,' best test score', gs.score(X_test,y_test))print('RFC')
param_grid = {'max_depth':[5,6,7],
             'class_weight':[None],
             'criterion':['entropy'],
             'min_samples_split':[1,2,4],
             'min_samples_leaf':[0.01,0.1] ,
             'ccp_alpha':[0,0.005],
             'n_estimators':[450,500,550]}gs = GridSearchCV(rfc,param_grid,scoring=scoring,n_jobs=-1, cv=cv, verbose=1)
gs.fit(X_train,y_train)print('best score:', gs.best_score_, ' best params', gs.best_params_,' best test score', gs.score(X_test,y_test))print('BG')param_grid = {'max_depth':[3,5,6,7],
             'n_estimators':[75,150,350],
             'subsample':[0.5,0.65,0.8]}gs = GridSearchCV(bg,param_grid,scoring=scoring,n_jobs=-1, cv=cv, verbose=1)
gs.fit(X_train,y_train)print('best score:', gs.best_score_, ' best params', gs.best_params_,' best test score', gs.score(X_test,y_test))

It will take a while. The output will be something like this:

The score is:

We reached 82% with decision tree classifier, followed by random forest and bagging classifiers. From the average scores of CV bagging classifiers seems the best.

Ensemble

What we can do now is to group together the best classifiers and try to see if we improve our scores. We create three ensemble of models. The first model will use voting=’hard’, the second and the third will use voting=’soft’. In the third model we decide the weights of each model, giving a slight advantage to one of them.

v_ests = [('rfc', rfc), ('bg', bg), ('lr', lr), ('dtree',dtree)]
eclf = VotingClassifier(estimators=v_ests, voting='hard')
v_ests = [('rfc', rfc), ('bg', bg), ('lr', lr), ('dtree',dtree)]
eclfsoft = VotingClassifier(estimators=v_ests, voting='soft')
v_ests_w = [('rfc', rfc), ('bg', bg), ('dtree',dtree)]
eclfsoft_w = VotingClassifier(estimators=v_ests, voting='soft', weights=[1,2,1])estimators = [('Hard',eclf),   ('Soft',eclfsoft),   ('Soft_w',eclfsoft_w)]

From the table below we see that we could do a little better than before looking at mean scores from cv scoring. we reached 83% which is not bad.

Conclusions

This tutorial covered most aspect of a typical workflow. This is far from perfection and the intention is just to revise the steps to clean a dataset, create new features, assess their relevance and implement and tune a model.