How To Score ~80% Accuracy in Kaggle’s Spaceship Titanic Competition

This is a step-by-step guide to walk you through submitting a “.csv” file of predictions to Kaggle for the new titanic competition.

Zaynab Awofeso

Published in

CodeX

13 min readJun 13, 2022

Introduction

Kaggle recently launched a fun competition called Spaceship Titanic. It is designed to be an update of the popular Titanic competition which helps people new to data science learn the basics of machine learning, get acquainted with Kaggle’s platform, and meet others in the community. This article is a beginner-friendly analysis of the Spaceship Titanic Kaggle Competition. It covers steps to obtain any meaningful insights from the data and to predict the “ground truth” for the test set with an accuracy of ~80% using RandomForestClassifier.

Index

Problem definition and metrics
About the data
Exploratory Data Analysis
Data Cleaning and preprocessing
Feature Extraction and Feature Selection
Baseline Model Performance and Model Building
Submission and Feature Importance

1. Problem definition and metrics

As the first thing, we have to understand the problem. It’s the year 2912 and the interstellar passenger liner Spaceship Titanic has collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension! To help rescue crews retrieve the lost passengers, we are challenged to use records recovered from the spaceship’s damaged computer system to predict which passengers were transported to another dimension.

This problem is a binary class classification problem where we have to predict which passengers were transported to an alternate dimension or not, and we will be using accuracy as a metric to evaluate our results.

2. About the data

We will be using 3 CSV files:

train file (spaceship_titanic_train.csv) — contains personal records of the passengers that would be used to build the machine learning model.
test file (spaceship_titanic_test.csv) — contains personal records for the remaining one-third (~4300) of the passengers, but not the target variable (i.e. the value of Transported for the passengers). It will be used to see how well our model performs on unseen data.
sample submission file (sample_submission.csv) — contains the format in which we have to submit our predictions.

We will be using python for this problem. You can download the dataset from Kaggle here.

Import required libraries

Reading Data

Let’s make a copy of the train and test data so that even if we make any changes to these datasets it would not affect the original datasets.

We will look at the structure of the train and test dataset next. We will first check the features present, then we will look at their data types.

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name', 'Transported'],
      dtype='object')

We have 13 independent variables and 1 target variable (Transported) in the training dataset. Let’s also look at the columns of the test dataset.

Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age',
       'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck',
       'Name'],
      dtype='object')

We have similar features in the test dataset as the training dataset except Transported that we will predict using the model built by the train data.

Given below is the description for each variable.

PassengerId — A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
HomePlanet — The planet the passenger departed from, typically their planet of permanent residence.
CryoSleep — Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
Cabin — The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
Destination — The planet the passenger will be debarking to.
Age — The age of the passenger.
VIP — Whether the passenger has paid for special VIP service during the voyage.
RoomService, FoodCourt, ShoppingMall, Spa, VRDeck — Amount the passenger has billed at each of the Spaceship Titanic’s many luxury amenities.
Name — The first and last names of the passenger.
Transported — Whether the passenger was transported to another dimension. This is the target, the column we are trying to predict.

Let’s print data types for each variable of the training dataset.

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

We can see there are three formats of data types in the training dataset:

object (Categorical variables) — The categorical variables in the training dataset are: PassengerId, HomePlanet, CryoSleep, Cabin, Destination, VIP and Name
float64 (Float variables i.e Numerical variables which have some decimal values involved) — The Numerical variables in our train dataset: Age, RoomService, FoodCourt, ShoppingMall, Spa and VRDeck
bool (Boolean variables i.e. a variable that has one of two possible values e.g. True or False) — The Boolean Variable in our dataset is Transported

Let’s look at the shape of our train and test dataset.

The shape of the train dataset is:  (8693, 14)
The shape of the test dataset is:  (4277, 13)

We have 8693 rows and 14 columns in the training dataset and 4277 rows and 13 columns in the test dataset.

Exploratory Data Analysis

Univariate Analysis

Univariate analysis is the simplest form of analyzing data where we examine each data individually to understand the distribution of its values.

Target Variable

We will first look at the target variable i.e. Transported. Since it is a categorical variable, let us look at its percentage distribution and bar plot.

True     0.503624
False    0.496376
Name: Transported, dtype: float64

Out of 8693 passengers in the train dataset, 4378 (about 50%) were Transported to another dimension.

Let’s visualize the Independent categorical features next.

Independent Variable (Categorical)

It can be inferred from the bar plots above that:

About 50% of passengers in the trainset departed from Earth
About 30% of the passengers in the training dataset were on CryoSleep (i.e confined to their cabins.)
About 69% of the passengers in the training dataset were going to TRAPPIST-1e
Not up to 1% of the passengers on the training dataset paid for VIP services

The cabin column takes the form deck/num/side. So, let’s extract and visualize the CabinDeck and CabinSide features.

We can infer from the plot above that:

About 60% of the passengers in the train set were on deck F and G
There’s not much difference between the % of passengers that were on Cabin side S compared to P

We have seen the categorical variables. Now let’s visualize the numerical variables.

Age

There are outliers in the Age variable and the distribution is fairly normal.

Room Service

We can see that most of the data in the distribution of RoomService are towards the left, which means it is not normally distributed, and there are a lot of outliers. We will try to make it normal later.

Spa

There is a similar distribution as that of RoomService. It contains a lot of outliers and it is not normally distributed.

RoomService, FoodCourt, ShoppingMall, Spa, VRDeck are the amount the passenger has billed at each of the Spaceship Titanic’s many luxury amenities, so let’s see if VRDeck, FoodCourt and ShoppingMall have a similar distribution.

VRDeck

FoodCourt

ShoppingMall

We can see that VRDeck, FoodCourt and ShoppingMall have a similar distribution. They are all not normally distributed, and they all have outliers.

Bivariate Analysis

After looking at every variable individually, we will explore them again to see their relationship with the target variable. First, we will find the relationship between the categorical variables and the target variable.

To do this, we will first create a dataframe to store the no of passengers transported, and the percentage of passengers transported for each categorical variable.

Now, let’s see how the categorical variables relate to transported.

We can infer that:

About 64% of the Passengers from Europa were Transported
About 78% of the Passengers in CryoSleep were transported
The proportion of Passengers debarking to 55 Cancri e transported to another dimension is greater compared to those debarking to PSO J318.5–22 and TRAPPIST-1e
About 38% of the Passengers that paid for special VIP services were transported

Next, let’s at how the CabinDeck and CabinSide columns relate to transported. We will follow the same steps as above.

Cabin Deck B and C have the highest percentage of passengers transported
The proportion of Passengers in Cabin Side S transported to another dimension is greater compared to those in Cabin Side P

The PassengerId column takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp, their number within the group. We want to know how the number of people in a group relates to if they are transported or not, So, we will extract the PassengerGroup feature from the PassengerId column, get the number of people in a Group, and then visualize how it relates to the transported feature.

There is no clear pattern of how the number of people in a group affects if they are transported or not. So, we will look at how “if the passenger is alone or not” affects if they are transported.

It seems more Passengers that were not alone were transported to another dimension compared to Passengers that were alone.
The Name column also contains the first and last names of the passenger. So, let’s extract the Family Name (last name) of each passenger to see if family size may affect if passengers are transported or not.

The percentage of smaller families transported is more than that of larger families. This could be that smaller families are rich families, and were transported. Let’s see how family size affects income.

To do this we will add all the amounts each passenger billed at each of the Spaceship Titanic’s many luxury amenities. Then, we will plot it against FamilySizeCat

Our hypothesis seems to be correct. It seems passengers with a smaller family size are wealthier.

Now let’s visualize numerical independent variables with respect to the target variable.

It looks like the percentage of passengers between the Age of 0 to about 4 transported is more than the percentage of older passengers transported. We will create a new column AgeCat to confirm if more younger passengers were transported compared to older passengers.

We can infer from the plot above that:

about 74% of passengers within the Age range of 0–4 were transported
about 60% of passengers within the Age range of 5–12 were transported

Now, do the same for the remaining numerical independent variables.

Observations:

The bills spent by transported passengers appear to be concentrated and approaching zero.
VRDeck, Spa and RoomService appear to have a similar distribution, while ShoppingMall and RoomServices appear to have a similar distribution.

We have seen how Family size affects expenditure. Now let’s see how passengers elected in Cryosleep relates to expenditure.

It can be seen from the plot above that passengers in CryoSleep have 0 expenditure. Now let’s see how VIP status affects expenditure.

It can be seen that passengers with VIP status have a higher expenditure compared to passengers who don’t.

Let’s also see how the age category relates to total spending of a passenger.

From the plot above it can be inferred that:

Passengers within the age range of 0–12 had 0 expenditure
Expenditure increases with the Age

4. Cleaning and Preprocessing

After exploring the variables in our data, we can now impute the missing values and treat the outliers.

First, let’s drop the columns we created for the exploratory data analysis.

We will combine the train and test data to make cleaning and preprocessing easier

Let’s look at the shape of our new dataset.

(12970, 13)

The dataset has 12970 rows and 13 columns. Let’s look at the percentage of each variable missing in our dataset.

PassengerId     0.000
HomePlanet      2.221
CryoSleep       2.390
Cabin           2.305
Destination     2.113
Age             2.082
VIP             2.282
RoomService     2.028
FoodCourt       2.228
ShoppingMall    2.359
Spa             2.190
VRDeck          2.066
Name            2.267
dtype: float64

There are missing values in every column except the PassengerId column but the missing values are not up to 50% of the variables. We will treat the missing values in the categorical columns first by imputation using mode.

['PassengerId',
 'HomePlanet',
 'CryoSleep',
 'Cabin',
 'Destination',
 'VIP',
 'Name']

Now let’s find a way to fill the missing values in the Numerical features.

['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

We will start with Age. While performing EDA we saw that RoomService, FoodCourt, ShoppingMall, Spa and VRDeck totals 0 if Passenger’s Age is less than 13 or on CryoSleep so let’s create a function to handle that.

Now, lets fill the remaining missing values with using mean.

PassengerId     0
HomePlanet      0
CryoSleep       0
Cabin           0
Destination     0
Age             0
VIP             0
RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
Name            0
dtype: int64

As we can see, all the missing values have been filled in the dataset.

Outlier Treatment

As we saw earlier in our univariate analysis, RoomService, FoodCourt, ShoppingMall, Spa and VRDeck contain outliers so we have to treat them as the presence of outliers affects the distribution of our data. To do this we will clip outliers on 99% quantile.

Our dataset is now clean!

5. Feature Extraction and Feature Selection

Based on our EDA let’s create a function to create new features that might affect the target variable.

Let us now drop the variables we used to create these features that are not so relevant to remove the noise from the dataset.

(12970, 15)

Now, we will convert our categorical data into model-understandable numerical data.

Next we will split the data back to get the train and test data.

Let’s print the shape of the train and test data to be sure we split the data right.

(8693, 23)
(4277, 23)

6. Baseline Model Performance and Model Building

It is time to prepare the data for feeding into the models.

Feature selection always plays a key role in model building. We will perform a χ² to retrieve the 22 best features as follows.

Index(['Age', 'CabinDeck', 'DeckPosition', 'Regular', 'Luxury',
       'TotalSpendings', 'DeckAverageSpent', 'NoRelatives', 'FamilySizeCat',
       'HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars',
       'CryoSleep_False', 'CryoSleep_True', 'Destination_55 Cancri e',
       'Destination_TRAPPIST-1e', 'VIP_False', 'VIP_True', 'CabinSide_P',
       'CabinSide_S', 'IsAlone_Alone', 'IsAlone_Not Alone'],
      dtype='object')

Next, for our model building we will use Random Forest, a tree ensemble algorithm and try to improve the accuracy.

We will use cross validation score to estimate the accuracy of our baseline model.

0.7889098998887654
0.01911345656998776

We got a mean accuracy of 78.9%, now we will try to improve this accuracy by tuning the hyperparameters for the model. We will use grid search to get the optimized values of hyper parameters. Grid-search is a way to select the best of a family of hyper parameters, parameterized by a grid of parameters.

We will tune the max_depth and n_estimators parameters. max_depth decides the maximum depth of the tree and n_estimators decides the number of trees that will be used in the random forest model.

RandomForestClassifier(max_depth=11, n_estimators=101, random_state=1)

So, the optimized value for the max_depth variable is 11 and for n_estimators is 101. Now let’s build the final model using these optimized values.

RandomForestClassifier(max_depth=11, n_estimators=101, random_state=1)

Now, let’s view the new accuracy score of our model with optimized parameters to confirm it improved.

0.8047907728163567
0.018872624931449773

The model now has a mean accuracy of 80.5% which is an improvement. It’s time to make predictions for the test dataset using our selected features.

7. Submission and Feature Importance

Before we make our submission, let’s import the sample submission file to see the format our submission should have.

As we can see, we will only need PassengerId and Transported for the final submission. To do this we will use the test set’s PassengerId and our Prediction. Remember we need to convert 0 to False and 1 to True.

Feature importance allows you to understand the relationship between the features and the target variable. Let us plot the feature importances to understand what features are most important and what features are irrelevant for the model.

We can see from the plot above that Luxury is the most important feature, followed by TotalSpendings, and Regular. So, feature engineering helped us in predicting the target variable.

I hope you enjoyed reading. You can find my code on GitHub.