How to start to win a Kaggle competition: Expert Kaggler tips from Kaggle Days Meetup Bangalore (14–15 Sept’19)

Vinay Kumar

Published in

Data Science Network (DSNet)

13 min readSep 16, 2019

Day 1 (14 Sept’ 19)

Hi Data Innovators,

So, you were not in Bangalore or for some reason you were not able to make it to the Kaggle Days meet-up? Attended but didn’t make notes? Don’t worry we have prepared a writeup here to summarise wisdom shared on Day 1 as well as Day2. We also got major of QAs covered here. Hope you enjoy reading this simultaneously trust it’s insightful and helpful for you to get started with Kaggle.

We (me and Vitika Jain) have roughy noted our key takeaway from Kaggle Days Meetup Bangalore. We have kept the article as concise & crisp as possible.

Day 1 Session 1:

Kaggle GrandMaster: Rohan Rao: H2o.ai Data Scientist

Talk: “On-Kaggle Vs Off-Kaggle”. About: The differences, positives, and negatives of Data Science On and off Kaggle, and why both should be balanced.

“Kaggle is my dream second job but comes at a sacrifice” Rohan Rao (Competitions Grandmaster) (https://www.kaggle.com/rohanrao)

Here we can see a brief side-by-side comparison of what happens and required when you are on Kaggle and off the Kaggle i.e. may be on Job or in Industry.

Final thoughts by Rohan: We should go on and off Kaggle to balance our learning and become a really great Data Scientist. Both have their own advantages and possible limitations. It requires a lot of determination, passion, time and effort to do Kaggling. This definitely is something which needs to be prioritised over maybe a weekend party, for something good to happen for you.

QA Session for the first Talk:

Participant Question (PQ): How to gain productivity on the starting stage of Kaggling?

Speaker Answer (SA): Use a public script to start, go exploratory and understand how things are working the way they are. Team up with someone new, someone different for understanding how you can create a blend of different thought process for your progressive learning.

PQ: How do you apply Machine Learning for Kaggle, you start with a traditional algorithm or you take a stab at new algorithms first, what’s your thought process when you approach a Kaggle problem?

SA: On Kaggle you can do all sorts of mix and match. Ensemble algorithms. Ease of data access, the stability of data & algorithm, result interpretability and productionisation is very important in industry however not so important for you to succeed in Kaggle. Do lots of Feature Engineering. Just think of different ways to optimise on the Evaluation Matrix (KPI). Yeah, that’s the Kaggle way ;)

PQ: How do you pick a competition in Kaggle, any tips?

SA: Pick a competition where you find lots of participants to experience better forum discussion, pick smaller data sets for easier EDA, when you pick a competition stick to it from start to end, if its an upcoming one, which will help you understand the complete context. There are also few knowledge and data exploratory competitions, participate in those as well.

Are you willing to sacrifice a few things in life and do Kaggle? Want to learn more from the Speaker, follow and reach out to the Speaker directly to learn more from him.

Day 1 Session 2:

Kaggle Expert: Aakash Nain: Research Engineer at Ola.

Talk Theme: “Is model.fit() enough?”

The presentation started with The usual Data Cycle which included the following:

Data Collection: It’s a multistep process. Involves collection and annotations. Annotation is a very expensive process and requires trained annotators and in some cases subject matter experts too.

Different options for data annotation: Crowdsourcing marketplaces like Amazon Mechanical Turk, Employ in-house annotation (Costly Affair)

Data Cleaning: GIGO, Who is accountable for human error in annotations?

Data Collection and Data Labelling are the two most expensive task in Data Science lifecycle. If we are not involved in Data Collection and Data Labelling then maybe we have a very limited understanding of the created model.

Modelling: Choose the right architecture which is again a time-consuming process. Architecting is an expensive, hardware-dependent, time taking and iterative process. Hardware dependency also limits your choice. On Kaggle you don’t need to create a deploying environment. It’s all on the cloud. Just use kernels to write and run codes.

Challenges in real-world: Test data changes very often causing Data Leakage. Need to capture rare and missed events. Validation requires human intervention.

Common Failure: Pre-trained model like what we have in word embedding mostly doesn’t gets our job done for lots of different use-cases.

The worst part of NN is it fails silently. We hardly get to know while working on a particular use-case about these silent failures. Kaggle forum offers the opportunity to see the possibilities and access the reason for failure for a particular NN. Debugging NN is also a tough problem. Kaggle forum has great articles on this.

Sometimes the problem is simpler but kaggle amateurs end up applying state of the ART algorithm like Resnet-152 which could have been solved using shallow network instead. So the suggestion is to start with a simpler solution and grow the complexity iteratively to gain better results.

“Do not use a cannon to kill a mosquito” Confucius

Final thoughts by Aakash Nain

Deep Learning is a cool thing but do we require it for our use-case. Don’t apply Deep Learning if simpler algorithms/solutions will work for you. Know that deep learning might offer a very accurate solution but can we answer how something is happening to a stakeholder with the super complex model that we have built using the ultra-super complex mechanism in Deep Learning. In other words, take care of Model Explainability of the model that you have built. Accept a model if you can answer “why the model behaved that way?”

So don’t be like the third guy quoting “Not in the training data” �� .

QA Session for Second Talk:

PQ: How do we go about creating a proper validation set.

SA: Look for competitions like Mercedes-benz. Read more in Kaggle forum. Build a strong validation set for better results.

PQ: How to debug a NN?

SA: Read up more from Andrej Karpathy blog. Go step by step.

PQ: How far pre-trained models are useful.

SA: Start from venues which have been already explored. Try to use the prebuilt model’s wisdom and knowledge. This might also be task-dependent and might require you to actually start from scratch in case pre-trained doesn’t work for particular use-case at all.

PQ: How did you jump 600 Ranks in diabetic-retinopathy-detection competition.

SA: We used a trick. We used the old dataset to generate bias into the new dataset. Phew, it worked for us.

Hope this was insightful for you. See you as a kaggle master soon. Go Ahead, all the best.

Day 2 (15 Sept’ 19)

Day 2 Session 1:

Kaggle x2 Master: Mohammad Shahebaz, Data Analyst at Societe General

Talk Theme: Feature Engineering To Crack Top 1% Private LB on Kaggle

About: Have you ever wondered why the features you make end up overfitting or not again significant jump on Kaggle’s leaderboard. Is private leaderboard a challenging ladder to climb? Mohammad Shahebaz explained his experiences with Feature Engineering. He gave insights and his recipes on how to approach your next Kaggle competition.

He started with his story about how in 8 months he could reach a good place in Kaggle. And he also shared about his blog where he shares the recipe to go From a novice to one of the youngest Kaggle Competition Master and landing in a Fortune 500!.

What is Public Leaderboard: It is evaluated on particular test data set which is hidden and we get to see the result towards the end of the competition. Read more about Private vs Public leaderboard in Kaggle here.

Let’s see what his Recipe contains:

Recipe 1: He recommends spending an appreciable amount of time on understanding the business problem, the business use-case and doing EDA to have a detailed understanding of the data and the problem. Remember everything in data science starts with a question, followed by other steps which roughly includes data collection, data cleaning, its exploration, analysis, modelling and prediction among others.

Recipe 2: Understand the problem to be solved first and then think about the features you would require/expect in the dataset to solve the problem. To reiterate don’t look at the dataset first, possibly that gonna bias you about the feature to be created or the feature you would have actually thought to solve the problem if have you not seen the dataset. So imagine your own feature set. Think out of the box and then check the box (Dataset) 😆. Are the features you could imagine or expect, not provided in the dataset? Well, my friend, you already have identified new features. Good Work on feature engineering!

Recipe 3: Explore the possibility of using external data/dataset. Do data enrichment if allowed/possible. Check the metric to optimise upon as in if its Accuracy, AUC & plan the quest accordingly.

Recipe 4: Oh I see, you won the competition, want to claim the prize money? Wait a sec. first, you should be able to reproduce what you did to win the competition. So make sure you document before you jump into coding.

Best way to make sure this is to maintain a colour coded changelog as you should see below. Different colours may represent a different kind of action. For example yellow may represent EDA, green may be a feature addition and so on.

Enriched changelog :

Recipe 5: A simple baseline: Always do baselining for different algorithm/models. Try simpler first and then slowly go complex.

Recipe 6: Friend and Foes: Document the problems/steps involved. Solve one at a time. Track your progress.

Recipe 7: When Columns cheat you: Sometimes data is so wide and anonymised. Do proper EDA to understand the data properly. Can you detect any anonymized data? Can you reverse engineer?

Recipe 8: You might be required to do some reverse engineering to know what is it actually and where that is coming from. He suggested, before doing your own EDA don’t try any public kernel. Do your own EDA to understand data type, distributions, outliers etc. Check if there is any imbalance in the data set.

Recipe 9: Plot to visualise and learn more about the dataset. Visualisation shall help you do better feature engineering and reveal mysteries from the dataset. Unlock those mysteries to move forward in your quest.

A single feature is better than 100 trained ensembles. A single feature will make your model and thus experimentation faster.

He emphasized how he won a few competitions with only simple features.

Recipe 10: The Do’s & Don’ts: Self-explanatory in the below image.

Recipe 11: No that’s not Curriculum Vitae its Cross-Validation. 😜

Recipe 12: Follow good practices for code maintenance and execution speed up.

Well, Sorry but we have no idea about Recipe 13 & Recipe 14 where it went. May be included slides didn’t have the numbers.

Recipe 15: Yeah that all the quest is all about. Keep learning 👍.

Learn ♻️ Attempt and finally, you start to win.

Yep focus on learning and once you have learned enough eventually you are going to Win. Contact the Kaggle x2 Master himself to learn more.

Day 2 Session 2:

Kaggle x3 Expert: Sanyam Bhutani: Data Science Engineer at Swiftace

Talk: How to track ML Experiments Effectively

The usual pipeline for working on a machine learning experiment is very different from Software Engineering. This talk highlighted about tracking the experiments and the iterative nature of the same effect inside of a Jupyter notebook. Sanyam also explained how to effectively apply these ideas to Kaggle competitions and make these work with data science teams.

Well here is the surprise �� . We believe Sanyam’s slides are self-explanatory and we are not seeing any value addition by adding my own made-up descriptions, except that we would make you read the same text twice. So feel free to go through the slides attached below which are in an appropriate order.

Sanyam added that if you are looking to start contributing to any open source project contributing to Jovian may be a great starting point in your open-source contribution journey. Contact Sanyam Bhutani directly know how you can contribute or in case you may have any doubts in his slides.

Day 2 Session 3:

Speaker: Usha Rengaraju

Talk: Demystifying SVM

Usha explained about how SVM can be utilized with linearly inseparable data for classification problem.

She also explained following on whiteboard.

Concept of Hyperplane in 2D & 3D, Maximal Margin Classifier, The Soft Margin Classifier, Kernal.

The event had two parallel & very interesting workshops in the second half. Titles are as follows:

Workshop, Hackathon: Getting started with Kaggle Competition

Workshop: “Understanding your WhatsApp chat data”

Participants really got insights from this meetup, got to learn how to start on kaggle and were highly appreciative of the untiring and highly appreciable efforts put by Data Science Network. Thank you, DSNet for organising this amazing meetup.

Thank you for your patient reading. We hope by now you are excited enough to start participating in Kaggle competitions if you have not started it already. We welcome constructive feedback or any feedback which can improvise your takeaway from my upcoming articles. Please let us know at vinay2k2@gmail.com. Feel free to connect or follow us on LinkedIn:

VINAY KUMAR — Clinical Data Scientist: Healthcare — SyTrue | LinkedIn

✔ Working as Consultant Data Science and NLP✔ MS by Research (ML & NLP) from IIT Kharagpur with Distinction (CGPA…

www.linkedin.com

Vitika Jain — Team Lead — SnapBizz Cloudtech Pvt Ltd | LinkedIn

Experienced Team Lead with a demonstrated history of working in the retail industry. Skilled in Python, Numpy, Pandas…