Machine Learning and Data Analysis with Python, Titanic Dataset: Part 3, Submit to Kaggle

Quinn Wang
Analytics Vidhya
Published in
4 min readFeb 22, 2020

Last time we cleaned up our training data and built a baseline model. In this part of the series we are going to do the same prepossessing on the test data and submit our predictions through the Kaggle kernel. As usual, link to a video version will be at the bottom.

Let's get started!

Let's first take a look at the submission file format again:

Sample submission file from Kaggle

Then do the same preprocessing as we did for the training set:

Preprocess data on the test set
  • Read the test data as a Pandas dataframe and store it in a variable called test_df.
  • Fill in the missing Age columns for the test set. Since there is more data in the training set, to get a more representative mean, we want to set the missing Age entries as the mean age value of passengers with the same gender from the training dataframe. Note that one of the conditions we used on the training set df is df[“Sex”]==0 instead of df[“Sex”]==‘male’, and this is because the Sex column in the training set has already been converted to integer representations.
  • Then drop the 3 values Name, Ticket, and Cabin that we are not using at the moment.
  • Convert string values in the Sex column to 0’s and 1's.
  • pd.get_dummies on test_df to get one-hot encodings of the Embarked feature.

We have now performed the exact same preprocessing steps as did with the training set, but:

Unexpected NaN

.isna().any() returns a boolean value for each column checking whether or not it contains any NaN entry. Using this condition to index on the dataframe columns, we can get a list of all columns that contain missing values. Running this for the test_df, we know there are missing values in the Fare column.

But hold on…

Didn’t we just clean up the test data?

This is an example of how the test data could look different from the training data. Just because we didn’t see a passenger with a missing Fare entry in the training set, doesn’t mean we can safely assume this feature is filled for all data to be seen.

Passenger missing Fare entry

Luckily we only have 1 instance where we are missing the Fare entry, suggesting that this is a rare event and hence understandable why it didn’t happen in the training set. If the ratio of missing Fare entry was higher, we would have to examine why this feature in the test set seems to go under a different distribution.

Our missing-Fare-passenger is boarded on Pclass 3, so we can fill in the missing value with the average of all Pclass 3 passengers, which uses the same idea as when we filled the missing Age entries, except we are making the assumption that Age correlates more with Sex, and Fare with Pclass.

Filling in the missing Fare column

Now we are ready to make predictions!

Predicting on the Kaggle test set

Make sure the dataframe we want to make predictions on have the same columns in the same order as the dataframe we used to train the model.

The order of these predictions are going to map to the order of rows in the dataframe. So if we make another column, call it Survived, and set it to the predictions values, we would have mapped our predictions to the PassengerId as required.

Map predictions and save

We want to save the predictions in a .csv file by using the Pandas method .to_csv({file directory}). Make sure to have the index attribute set to False, otherwise when Kaggle reads your submission it’s going to get this Unnamed: 0 column and will reject the submission.

Simply upload this file from your local computer to the Kaggle submission page and wait for results!

Here’s the same content in a video:

Video tutorial of this article

--

--

Quinn Wang
Analytics Vidhya

Data analyst with an interest in machine learning. Passionate about understanding the theoretical backings of ML algorithms.