A few months ago, Data Analysis Techniques to Win Kaggle was published. As far as I know, this is the first commercial book about Kaggle in Japan and probably in the world.
The authors of this book are top Kagglers including Grandmasters. Recently I could spare time and finally finished reading. Here is the brief book review.
What is Kaggle?
Before start, let me introduce Kaggle shortly.
Kaggle is the competition platform about data science and machine learning. It is similar to Topcoder and AtCoder, but the competition span is a few months, longer than them, and pure computer science problems are out of focus.
When you win the competition, you can get money, expensive GPU or just honor based on each competition rule.
There are similar competition platforms like SIGNATE. The unique features of Kaggle are kernel and discussion. Kernel is runtime environment anyone can use for calculation. Discussion is the bulletin board to share insights and pitfalls.
Although prizes can be the reason to join the competitions, the techniques you can learn is the other important motivation.
Who should read this book?
First of all, the main focus of this book is table data competitions. Each Kaggle competition have each task like computer vision, audio processing, NLP and table data. If your interest isn’t table data, the first step might be try the other resources.
However, this book describes how you tackle the copmetition. This is general knowledge. Also, table data techniques can be used for post processing in any tasks.
One of the authors wrote English version of the table of contents. You can check the details.
- Chapter I: What is data analysis competition?
What is kaggle and how data analysis competition works.
- Chapter II Tasks and Metrics
How to design metrics. Sometimes different metrics works better than direct metrics.
- Chapter III feature engineering
Encoding is the key for categorical data. Time series data need special handling.
- Chapter IV Modeling
How GBDT works. Shallow neural network shined in some compeitions.
- Chapter V Validation
Difference between hold-out and k-fold. Time series data need special handling.
- Chapter VI Model Tuning
Useful libraries for hyper parameter tuning
- Chapter VII Ensemble
Last techniques to improve a few fractions.
What I could learn
Generally the effective way to improve the score is try past solutions. There are many useful websites like Kaggle Past Solutions, wiki, community Slack and blogs. However, these are roughly competition-based information as tendency.
The book’s information is based on each component like loss design, feature creation. It is easy to compare and good as dictionary. Each insight come from actual experiences top Kagglers faced. It is more practical than just theory. The codes are available on GitHub.
Besides, Kaggle rules are described well. For example, I didn’t know about the submit limit after team merge.
What is missing in this book?
As mentioned above, this book’s focus is table data.
Also, in some sections, XGBoost are specially featured. As one of the authors mentioned, LightGBM has some trend. I don’t care, but it can be revised in the future version.
Neural network pages are limited too. It can be improved.
Ready perfectly. All I need is join competition.