Stones for Stepping: Navigating the Kaggle Submission Process
I remember my first Kaggle competition like it was yesterday. A warm laptop glowing in front of me, a cup of coffee by my side, and a good understanding of Python’s Scikit-learn library. I felt pretty confident that my path to Kaggle greatness was going to be easier than a walk in the park! Ok, thats an exaggeration, but I didn’t realize that there were a couple of nuances to submitting predictions to Kaggle that would slow me down for a minute or two. Thus, this blog post is all about navigating the Kaggle submission process. Before I begin, I will point you to my Github where I have taken the time to create a series of three separate Jupyter Notebooks where I cover a step by step breakdown of how to explore, clean, analyze, visualize, and use machine learning to predict survival for Titanic passengers. The notebooks cover a progression in the workflow evolution that occurs as you advance from data analysis to data science and also offers a more in depth visual representation of how to submit results to Kaggle. I will cover this progression in a future blog, but for now, I just wanted to clearly cover the Kaggle submission process.
Train.csv and Test.csv == Two Peas in a Pod
Once you have created a profile on Kaggle.com and toggled over to a competition on the competitions page, there are two datasets you will need to download labeled train.csv and test.csv. It is important to be aware that these two files are actually part of the same original dataset. To prove this, I have taken a screenshot of the last five rows of the train.csv dataset and first five rows of the test.csv dataset. If you pay close attention to the PassengerId column, you will notice that the last row of the train.csv is 891 and the first row of the test.csv is 892.
She Labels Me, She Labels Me Not
The second important aspect to be aware of is that the train.csv contains a label column (‘Survived’) while the test.csv does not. This is because when building a supervised machine learning model, you will use the features of your dataset to predict your label. The idea behind Kaggle is that all else being equal, you will build and test a model on the training dataset, pass the test data into that model, and submit the predictions from the test data to Kaggle. The Kaggle website will contain the label for the test data and once you submit your predictions, Kaggle will score the accuracy of your model and rank you in relation to your competitors.
In order to upload the predictions from your test data to Kaggle, you will need to format the data frame so that one column is reserved for your predictions and the other its associated index.
The first step to create this data frame will be to add a label column called ‘Survived’ and assign the values in it to the predictions from the test data. Next, convert the associated index (‘PassengerId’) back to a column. In my example above you may have noticed that I used the index_col parameter to read in the PassengerId column as an index for my data frame. The reason for this was to help create a more visual example. However, since I want to upload my results, I need to convert it back to a column. Finally, I slice out the predictions column (‘Survived’) and its associated index (‘Passenger Id’) to export to a csv file.
Now that the predictions are in a data frame, I simply upload the results.csv file to Kaggle and that’s it! In my next blog post I will take you through the workflow of progressing from Data Analysis to Data Science continuing to use this particular Github repo that I have created as my template. Stay safe and Kaggle on!