How I got a score of 82.3% and ended up being in top 3% of Kaggle’s Titanic Dataset
As far as my story goes, I am not a professional data scientist, but am continuously striving to become one. Luckily, having Python as my primary weapon I have an advantage in the field of data science and machine learning as the language has a vast support of libraries and frameworks to back me up. I also read books on the subject and my favourites are “Introduction to Machine Learning with Python: A Guide for Data Scientists” and “Hands-On Machine Learning with Scikit-Learn and TensorFlow”.
But this alone was not enough. Upon surfing through various blogs, going through several sites and discussing with friends I found out, to become an expert data scientist I definitely need to up the ante. Take part in competition, build online presence and the list goes on and on. Then I came across Kaggle. Like HackerRank is for general algorithmic competitions, Kaggle is specifically developed for machine learning problems. Had to try it. It hosts a variety of competitions wherein the famous “Titanic” problem is what welcomes you on signing up in the portal. What next? I downloaded the training data, set up my machine with all the libraries I will ever need to solve it. I even initialised an empty repository to save the hassles afterwards. The only part remaining was to process data and train a model. “Should be simple, How tough could it get?”, I asked myself having a grin on my face.
“Should be simple, How tough could it get?”, I asked myself having a grin on my face.
Hurriedly, I parsed the data from downloaded csv file, fed it to a Decision Tree model to train, predicted survivability of test passengers and uploaded the results. I got 64% and was in the bottom 7% of leader board. Yes, you read it right; bottom 7%!!!
Here is my original, first version of code
The results crushed my ego right in front of my face. Yes, it taught me that real world problems can’t be solved in 5 lines of code. I am saying this in context of one of my earlier blogs — “Simple Machine Learning Model in Python in 5 lines of code” :D
It taught me that real world problems can’t be solved in 5 lines of code.
I sat back, re-visited and read more chapters from the books I mentioned earlier. I read the part “Building a Complete Machine Learning Model End to End” thoroughly. So this is not about feeding garbage to a model, the data needs to be as clean as possible which directly reflects the performance of a model used.
The analysis starts now…
Since I had used Jupyter Notebook for the analysis part, please go to my github project for detailed analysis. The link is here:
Contribute to kaggle-titanic development by creating an account on GitHub.github.com
I also built a hobby project to brush up my skills in Python and Machine Learning. Currently hosted here, it can run and save some Machine Learning models on the cloud. Have to improve it more though…