Titanic: Machine Learning from Disaster | Kaggle
After flirting with the idea of getting into Machine Learning for far too long, I finally took my first successful step in the field. Having been through bits of Andrew Ng’s course and a lot of articles, I realized that to learn machine learning, I needed to focus on one thing and that is PRACTICE.
I already knew that Kaggle is to Machine Learning what TopCoder, Codeforces, SPOJ, CodeChef are to competitive programming so it had to be the platform I was supposed to practice on and scikit-learn being “The” Machine Learning Library, using Python was a no brainer. So, I started with the Titanic problem on Kaggle which is considered to be the first problem one starts when learning ML.
The problem requires very basic coding and ML knowledge and is a basic binary classification problem. Given a set of data about the passengers onboard the Titanic ship, we are supposed to create a model which predicts the survival for each of the passengers. The dataset contains information about 891 passengers as training data and required the survival prediction for 418 passengers provided in the test data.
Based on my understanding after reading some blogs (especially Machine Learning Mastery and Analytics Vidhya), I divided my approach into the following steps:
- Understanding the problem (The available data is pretty clear and so it was easy for this problem)
- Feature Extraction (This was surprisingly the most important and the most interesting part. Reading kernels of other people helped me a lot. Plotting and checking the importance of each feature followed by extracting multiple features from individual or combination of features, converting categorical features to dummy features, etc)
- Try different algorithms (I tried Logistic Regression, Random Forest, SVM, k-Nearest Neighbour and then tried combining the predictions for better accuracy)
- Evaluate each algorithm and choose the one which gives the best results
Finally, after trying out different algorithms and different combinations of them, I got an accuracy of 0.79904 on the test set (Top 21%)
I started this problem with a very basic idea about binary classification but while solving the problem, I consulted Google for different classification algorithms and touched upon quite a few followed by implementing them using scikit’s awesome library. Now I am at a stage when these algorithms are no longer Greek to me and I am more confident on my journey ahead.
My next step would be to get a clearer picture about different binary classification models post which I’ll try to get a higher score by taking better-informed decisions. Will move to a different problem after that.