What can we learn from the most popular Kaggle competition?
Hi, welcome to our first ever Kaggle day series !!
In this article, we would like to share key takeaways while working on “Titanic — Machine Learning from Disaster”, the most popular Kaggle competition.
Before we continue, If you would like to know more about data analytics and data science feel free to follow Cultigo’s social media pages. We are sharing the basic knowledge of data analytics and data science also a lot more content will come in the future.
How did we start to work on it?
Who here doesn’t know the infamous Titanic shipwreck?
Could we know how many people are survive or not in this accident only by looking at the initial passenger data?
such as sex, passenger classes, age and etc. Before we can answer that question better to take a look on how is the Titanic shipwreck happened.
- Titanic has three different passenger classes: first class, second class and third class.
- The rooms of these three classes have different locations in the ship
- The third class has higher causalities if we compare to the other two
Because water are flooding the bulkheads and makes the ship sink and split into two, the locations of the passenger classes become critical in determining which passenger is survive and not. Meaning that we can create a hypothesis that “the passenger class who are near the bulkheads have higher death rate”.
After researching and reading external references outside the data source we got from Kaggle. Without any coding and data analytics stuff, we are knowing that the passenger classes will become a critical feature for our machine learning model.
First takeaway: Do not start to code, start to asking and researching to find the answer outside your data to create a hypothesis
Could we validate our hypothesis using data?
What we do in this Titanic case is try to handle missing data. Basically, there are two basic ways to handle missing data:
- Drop the missing data
- Use the average (if the data has numeric value)
You can choose the first way if you have a lot of data, and don’t mind losing some of them. But in real cases, data is important and valuable so you need to choose how you handle missing data wisely.
What we did when doing the preprocessing for the Titanic problem are:
- Drop several columns: ‘Cabin’, ‘Name’, and ‘Ticket’
- Drop NaN rows
- Fill missing value with average
Then you can use two functions below to check if you have done a good preprocessing process.
df.describe()
df.info()
Very simple checklist:
- You should make sure there are no null values
- The number of data for every column is equal
The other important thing you should check after we are done handling the missing data is to check if you have imbalanced data in your dataset. Why do we want to handle imbalanced data? the very simple answer is we don’t want our machine learning to have a bias to majority value in our data, so it will give us the “right” accuracy.
After doing preprocessing we will do visualization of the data. But in this case because we already research about Titanic first and found a very good infographic above, so we only visualize correlation matrix (a statistical technique that shows how two variables are related) between each data column.
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()
What we can learn is some of them have high correlation between each others, but the most important thing that “Pclass” and “Survived” have a correlation, which aligns with our hypothesis “the passenger class who are near the bulkheads have higher death rate” or “the passenger class related with higher death rate”.
Second takeaway: Use Exploratory Data Analysis to validate your hypothesis, not only for creating a cool and beautiful graph
Model Prediction
We have done the analysis and have the correlation between each column in our datasets. We also have preprocessed the data so it’s cleaner and ready to be consumed by a machine learning model.
But the question is
How do we able to choose which algorithm that suitable for our use cases ?
Because of the easiness to use and build machine learning models, it makes us confuse to determine it. The easiest way maybe to take a look on a cool cheatsheets that ML enthusiast build on the internet.
For us just taking a look at the cheatsheet was not an option, because we actually want to prove if those cheat sheet is correct and learn something from the process.
When we were working on this problem, we tried several algorithms. Started from SVM, Decision Tree, SDG, ANN, and CNN (The technical article about it is coming soon).
What we learn is instead of using the shiniest and the latest algorithm, we should try to fit the data with the simplest algorithm first and we should do it perfectly.
For example, the accuracy of the SGD model with normalizing features (76%) is on par with CNN and ANN models (78%), but on the other hand, we only get 42% of accuracy if we train the SGD model without normalization. This result proves that a simple process like normalization can bring a greater impact to model performance.
Third takeaway: Start from the simplest model and do it perfectly. After that you can scale up your model and introducing more complex algorithms.
Conclusions
“Titanic — Machine Learning from Disaster” is a simple challenge on Kaggle that anyone can participate and learn from it. But sometimes we forget about process of learning and create intuition on building. Because we always focus to create models that have high accuracy with a low error rate. Here are three takeaways that I can learn from it:
- Do not start to code, start to asking and researching to find the answer outside your data to create a hypothesis
- Data preparation is the most important thing, know what features that we want to include in our machine learning training process.
- Start from the simplest model and do it perfectly. After that, you can scale up your model and introduce more complex algorithms. Even a simple step like normalizing the feature is so important.
Feel free to share yours! and if you like these topics don’t forget to take a look on Cultigo’s social media pages.