# Titanic Challenge — Machine Learning for Disaster Recovery — Part 2

## Part-2 — Predictive Model Building

Jan 14 · 11 min read

# II — Feature engineering

In the previous part, we flirted with the data and spotted some interesting correlations.

In this part, we’ll see how to process and transform these variables in such a way the data becomes manageable by a machine learning algorithm.

We’ll also create, or “engineer” additional features that will be useful in building the model.

We’ll see along the way how to process text variables like the passenger names and integrate this information into our model.

We will break our code in separate functions for more clarity.

One trick when starting a machine learning problem is to append the training set to the test set together.

We’ll engineer new features using the train set to prevent information leakage. Then we’ll add these variables to the test set.

Let’s load the train and test sets and append them together.

Let’s have a look at the shape :

train and test sets are combined.

You may notice that the total number of rows (1309) is the exact summation of the number of rows in the train set and the test set.

When looking at the passenger names one could wonder how to process them to extract useful information.

If you look closely at these first examples:

• Braund, Mr. Owen Harris
• Heikkinen, Miss. Laina
• Oliva y Ocana, Dona. Fermina
• Peter, Master. Michael J

You will notice that each name has a title in it! This can be a simple Miss. or Mrs. but it can be sometimes something more sophisticated like Master, Sir or Dona. In that case, we might introduce additional information about the social status by simply parsing the name and extracting the title and converting it to a binary variable.

Let’s see how we’ll do that in the function below.

Let’s first see what the different titles are in the train set

`print(titles)# set(['Sir', 'Major', 'the Countess', 'Don', 'Mlle', 'Capt', 'Dr', 'Lady', 'Rev', 'Mrs', 'Jonkheer', 'Master', 'Ms', 'Mr', 'Mme', 'Miss', 'Col'])`

# This function parses the names and extracts the titles. Then, it maps the titles to categories of titles. We selected :

• Officer
• Royalty
• Mr
• Mrs
• Miss
• Master

Let’s run it!

Let’s check if the titles have been filled correctly.

There is indeed a NaN value in-line 1305. In fact, the corresponding name is Oliva y Ocana, Dona. Fermina.

This title was not encountered in the training dataset.

Perfect. Now we have an additional column called Title that contains the information.

# Processing the ages

We have seen in the first part that the Age variable was missing 177 values. This is a large number ( ~ 13% of the dataset). Simply replacing them with the mean or the median age might not be the best solution since the age may differ by groups and categories of passengers.

To understand why, let’s group our dataset by sex, Title, and passenger class and for each subset to compute the median age.

To avoid data leakage from the test set, we fill in missing ages in the train using the train set and we fill in ages in the test set using values calculated from the train set as well.

Number of missing ages in the train set

Number of missing ages in the test set

This data frame will help us impute missing age values based on different criteria.

Look at the median age column and see how this value can be different based on the Sex, Pclass and Title put together.

For example:

• If the passenger is female, from Pclass 1, and from royalty the median age is 40.5.
• If the passenger is male, from Pclass 3, with a Mr title, the median age is 26.

Let’s create a function that fills in the missing age is combined based on these different attributes.

Perfect. The missing ages have been replaced.

However, we notice a missing value in Fare, two missing values in Embarked and a lot of missing values in Cabin. We’ll come back to these variables later.

Let’s now process the names.

This function drops the Name column since we won’t be using it anymore because we created a Title column.

Then we encode the title values using a dummy encoding.

As you can see :

• there is no longer a name feature.
• new variables (Title_X) appeared. These features are binary.
• For example, If Title_Mr = 1, the corresponding Title is Mr.

# Processing Fare

Let’s imputed the missing fare value by the average fare computed on the train set

This function simply replaces one missing Fare value by the mean.

# Processing Embarked

This function replaces the two missing values of Embarked with the most frequent Embarked value.

# Processing Cabin

We don’t have any cabin letter in the test set that is not present in the train set.

This function replaces NaN values with U (for Unknow). It then maps each Cabin value to the first letter. Then it encodes the cabin values using dummy encoding again.

Ok, no missing values now.

# Processing Sex

This function maps the string values male and female to 1 and 0 respectively.

# Processing Pclass

This function encodes the values of Pclass (1,2,3) using a dummy encoding.

# Processing Ticket

Let's first see how the different ticket prefixes we have in our dataset

# Processing Family

This part includes creating new variables based on the size of the family (the size is, by the way, another variable we create).

This creation of new variables is done under a realistic assumption: Large families are grouped together, hence they are more likely to get rescued than people traveling alone.

This function introduces 4 new features:

• FamilySize : the total number of relatives including the passenger (him/her)self.
• Sigleton : a boolean variable that describes families of size = 1
• SmallFamily : a boolean variable that describes families of 2 <= size <= 4
• LargeFamily : a boolean variable that describes families of 5 < size

We end up with a total of 67 features.

# III — Modeling

In this part, we use our knowledge of the passengers based on the features we created and then build a statistical model. You can think of this model as a box that crunches the information of any new passenger and decides whether or not he survives.

There is a wide variety of models to use, from logistic regression to decision trees and more sophisticated ones such as random forests and gradient boosted trees.

We’ll be using Random Forests. Random Forests has proven a great efficiency in Kaggle competitions.

Back to our problem, we now have to:

1. Break the combined dataset in the train set and test set.
2. Use the train set to build a predictive model.
3. Evaluate the model using the train set.
4. Test the model using the test set and generate an output file for the submission.

Keep in mind that we’ll have to reiterate on 2. and 3. until an acceptable evaluation score is achieved.

Let’s start by importing useful libraries.

To evaluate our model we’ll be using a 5-fold cross-validation with the accuracy since it’s the metric that the competition uses in the leaderboard.

To do that, we’ll define a small scoring function.

Recovering the train set and the test set from the combined dataset is an easy task.

# Feature selection

We've come up to more than 30 features so far. This number is quite large.

When feature engineering is done, we usually tend to decrease the dimensionality by selecting the "right" number of features that capture the essential.

In fact, feature selection comes with many benefits:

• It decreases redundancy among the data
• It speeds up the training process
• It reduces overfitting

Tree-based estimators can be used to compute feature importance, which in turn can be used to discard irrelevant features.

Let’s have a look at the importance of each feature.

As you may notice, there is great importance linked to Title_Mr, Age, Fare, and Sex.

There is also an important correlation with the Passenger_Id.

Let’s now transform our train set and test set in a more compact dataset.

Yay! Now we’re down to a lot fewer features.

We’ll see if we’ll use the reduced or the full version of the train set.

# Hyperparameters tuning

As mentioned at the beginning of the Modeling part, we will be using a Random Forest model. It may not be the best model for this task but we’ll show how to tune. This work can be applied to different models.

Random Forest is quite handy. They do however come with some parameters to tweak in order to get an optimal model for the prediction task.

Additionally, we’ll use the full train set.

Now that the model is built by scanning several combinations of the hyperparameters, we can generate an output file to submit on Kaggle.

# IV — Conclusion

In this article, we explored an interesting dataset brought to us by Kaggle.

We went through the basic bricks of a data science pipeline:

• Data exploration and visualization: an initial step to formulating hypotheses
• Data cleaning
• Feature engineering
• Feature selection
• Hyperparameters tuning
• Submission
• Blending

Here is what I suggest for the next steps:

• Dig more in the data and eventually build new features.
• Try different models: logistic regressions, Gradient Boosted trees, XGboost, …
• Try ensemble learning techniques (stacking)
• Run auto-ML frameworks

I would be more than happy if you could find out a way to improve my solution. This could make me update the article and definitely give you credit for that. So feel free to post a comment.

Written by

## Towards AI

#### Towards AI, is the world’s fastest-growing AI community for learning, programming, building and implementing AI.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade