Should you Kaggle?

Aakash Nain
Imploding Gradients
8 min readSep 5, 2018

Another competition launched on Kaggle”, “It’s a competition platform”, “It’s a data science platform”, “Don’t start with Kaggle problems if you are a beginner”, “Kaggle is totally different from real-world scenarios”, etc. These are some of the statements that you must be seeing over twitter or LinkedIn, or hearing from someone within your company.

In this post, I am going to explain a lot of such things and I will try to clear some doubts/myths about it. Before diving deep into the topic, let me introduce myself. I am a Data Scientist with my expertise lying in the field of Computer Vision with Deep Learning. I love working on complex problems especially where messy data is involved. In short: Python in breakfast, Deep Learning in Lunch and Machine Learning in dinner.

This post is divided into three parts:

  • A brief introduction to Kaggle and some (damn!!) cool features of it
  • Is Kaggling a good thing or a bad thing? Why/Why not?
  • Where to start with?

What is Kaggle?

Most people know of Kaggle as a Machine Learning competitions platform. But is it? No, it is more than just a competition platform. Kaggle is an ecosystem for doing and sharing Data Science. In the early days, the focus for Kaggle was more to be the best Machine Learning competitions platform but over the past few years, it has evolved immensely. So what has changed? A lot of things.

In order to give you an overview, I will mention some of the best things that you will find on Kaggle. These will help you understand why it can be the perfect ecosystem that you are looking for.

1. Kernels

This is far by the best feature of this platform.

What are Kaggle Kernels?

Kaggle Kernels is a cloud computational environment that enables reproducible and collaborative analysis.

Put it simply, a kernel is essentially a jupyter notebook, or a script (in Python or R) or a R Markdown script running in a Kaggle docker that has almost everything installed for you.

Just fire a kernel and you ready for your Exploratory Data Analysis or building Machine Learning models right way. Some things that you might not be aware about Kernels:

  • Computation: For running your kernel, you get access to 4 CPUs, 16 GB RAM, 5 GB disk space, 6 hours of continuous execution time, internet access and a K80 GPU instance. Yes, you heard me right. 6 hours of free GPU!
  • Data Sources: In a kernel, you can add data either from Competitions, from Datasets available on Kaggle, or you can even upload your data and start working on it. Both your kernel and your datasets can be made private/public as you wish them to be.
  • Public Kernels: What differentiates Kaggle from others is people sharing their knowledge through code by making their kernels public. If you are new to Kaggle, this is the best place to start with. Just select a problem, read the public kernels and you will learn more than you expected in a single go. The kernels can also be forked to which you can do whatever modifications you like. An open source code always saves your time.
  • Weekly Kernels Award: Who does not like awards? And that too in $! Every week Kaggle team selects the best kernel as Kernel of the week and the author of the Kernel gets a $500 reward. I was so happy when I won it for the first time. An award is more like a value that you creates when you write a good public kernel.

You are a student with no personal rigs to experiment on or it might be the case that you are a professional like me but you still don’t have the personal computational resources. Cloud is always expensive, too expensive to be precise.

What to do? The answer is simple: Kaggle Kernels. For more information about kernels, check the official documentation. To see how can you work on your own project on Kaggle, check this awesome tutorial.

2. Datasets

For developing and experimenting with different machine learning models, data is required. If you are doing Deep Learning, then you must be aware of the fact that DNNs are always data hungry.

What if I tell you that there is one place where you can find almost any kind of data you are looking for? Yes, the Datasets feature on Kaggle is what you need. Whether it is CSVs/tabular data, or you are looking for an image data, or you are interested in NLP and speech and looking for those datasets, everything is here. Whenever I want to try something new, I choose a suitable dataset, fire a kernel and be done with it. No time wasted at all!

If you have some data that you want to upload, you can do that as well. Supported data formats include CSV, JSON, BigQuery, archives, etc. If you want to know more about Kaggle Datasets, then take a look at this fantastic documentation.

3. Competitions

Most of you must be aware of this, Kaggle is the best platform when it comes to Machine Learning competitions. At any point in time, there could be at least 4–5 competitions that are active on Kaggle. The diversity in the competition helps you develop skills in numerous fields of Machine Learning. The number of participants are growing each year. The competitions are tough but provides a steep learning curve. In order to check which competitions are going on right now, you can visit this link. Let’s move to the second aspect now.

Kaggling: Good or bad?

If you ask people on Twitter, Hacker News or Reddit, you will always get a mixed opinion. Some people might be appraising it a lot while others will be making trolls on Kagglers. I consider this as a normal behavior on these networks. I will try to list out the main points here.

Good

  1. Steep learning curve: Participating on Kaggle provides much more wider range of learning than you will find anywhere else online. People come up with great ideas and share the ideas in a public kernel. Not all the ideas are shared of course but 10 public notebooks will teach you so many new things that you will start doubting your skills for a second.
  2. Variety in problem sets: You will easily find data related to almost every field in data science and machine learning on Kaggle. One single place to get your hands dirty in every area. For example, I am a computer vision guy but I also get to work on Speech, Natural Language Processing, time series, tabular data, etc.
  3. An awesome community: The true power of Kaggle comes from its community. The people there are really helpful and they try to help you in every aspect without judging you for a second. It explains why people never refrain from asking very questions on Kaggle discussion forums. Discussions are itself another source of learning there.

Bad

  1. Addiction: Yes, you heard me right. Once you start doing it, it becomes an addiction like a drug. You thrive to compete more and more, explore more ideas. You might end yourself spending 300 hours a month in it.
  2. On Kaggle, the dataset is usually cleaned up before putting it up for a competition. You mostly find everything arranged neatly and you just have to look at the data, do your pre-processing, feature engineering, modeling and score on the leaderboard. This is good apart from the fact that it abstracts the processing of creating, gathering, and cleaning data.When you are working in a production environment, creating, gathering, and cleaning are three most important skills that you should possess (and it is also the most time consuming tasks!). After all, data is the true value!
  3. Too much of ensembles! When you compete, it is hardly any scenario when you win without making heavy ensembles. Ensemble is good but only up to an extent, especially in production. You will see people stacking 100s of models for one competition. This is because the score difference will be in decimals (like 0.00001) after a certain period of time and people stack as many models as they can to improve their score even by just a single decimal point. In production, a difference of 0.1–0.2% hardly matters and is insignificant (beware exceptions are always there).

That’s it. Nothing more, nothing less. Before moving to the next part I would love to tell you how it has helped me. My work involves doing a lot of machine learning and I am among those who focus too much on production. I started in this field in Jan 2016.

TensorFlow was just open sourced and all of a sudden there were too many courses everyone trying to teach how to do machine learning. I completed Andrew Ng’s Machine learning course which gave me a good theoretical knowledge to start with. That was my final semester in college and I had no idea where to practice all those things that I had learned.

I signed up at Kaggle at that time but it was only in late July 2016, when a competition launched by Red Hat was going on, that I became active on Kaggle. For the first time, I heard the word xgboost (eXtreme Gradient Boosting). I had no idea how to preprocess data, when to apply which algorithm, etc. That’s when I met people like Mikel (aka anokas) and Laurae. I started reading kernels, forums, asking people silly questions but every bit was worth it.

To the date, 60% of my learning comes from Kaggle. The number of ideas I saw people putting in here isn’t somewhat that I can do all by myself. I have not won any competiton though, mainly because of the fact that:

  • I do not have personal hardware to experiment on and cloud is too expensive.
  • I do not get too much time to work on these problems after job as I love to read a lot of research papers, trying out different things and always learning something new.

Am I suggesting that you should be Kaggling? Yes, I am. If you have even half an hour to spend something on to learn something new, spend it on Kaggle. It would seem to be overwhelming at first but once you are in, you will love it. Now, you must be thinking how/where to start with it then?

Where to start with?

  1. Do not directly jump into a competition. Start with a small dataset first. Visit Kaggle Learn first. Join a slack. For example, our KaggleNoobs Slack. It is one of the best slack out there.
  2. Go to Kernels page. Look at the public kernels for the datasets/competitions in which you are interested in. Read the kernels thoroughly.
  3. Pick a small dataset and make your first kernel. Better to write everything from scratch for the first time. You can code while looking at others code but code it by hand not just by copying/pasting.
  4. Join a competition. Start working on it. There is only one thing to remember: Code everyday, learn everyday.
  5. Don’t be shy or afraid to ask questions. If you are afraid to ask just because you think that someone will think of you as a naive or a dumb person, you can never learn.

P.S.: I hope you enjoyed reading this post. The name of our slack KaggleNoobs is not a justified one as everyone from noobs to Experts, to Masters and Grandmasters are there. The community is amazing and I am blessed to be a part of it.

--

--