First-time Competitor to Kaggle Grandmaster Within a Year | A Winner’s Interview with Limerobot
Join us in congratulating Sanghoon Kim aka Limerobot on his third place finish in Booz Allen Hamilton’s 2019 Data Science Bowl.
The Data Science Bowl, presented by Booz Allen Hamilton and Kaggle, is the world’s largest data science competition focused on social good. Each year, this competition gives data scientists a chance to use their passion to change the world. Over the last four years, more than 50,000+ competitors have submitted over 114,000+ submissions, to improve everything from lung cancer and heart disease detection to ocean health. This year, competitors were challenged to identify the factors that matter most to predicting player capability in an educational kid’s game by PBS. For more information on the Data Science Bowl, please visit DataScienceBowl.com
Let’s meet Sanghoon!
Sanghoon, what should we know about you?
Sanghoon: Well, I currently work as a data scientist at eBay Korea. I’m very interested in computer vision and natural language processing. My main interest these days has been to exceed the performance of LightGBM and XGBoost, with deep neural networks in most tabular data. In particular, I enjoys less focus on feature engineering and more focus on model architect design.
Fascinating. What were you doing prior to your current role that prepared you for this competition?
Sanghoon: I’ve been working in computer vision (especially face recognition) and natural language processing for about 10 years. I also majored in electronics, so I learned calculus, probability statistics, and linear algebra in my undergraduate course. Although I don’t really remember if I retained anything 😉. In the past five years, I‘ve been dealing with e-commerce data that consists of images, text, and tabular data.
On the Kaggle-front, I participated in my first competition in February 2019 and here I am!
So, you became a Grandmaster in about a year’s time?
S: Yeah, I guess so!
For this particular challenge, did you have much prior domain knowledge or expertise?
S: Two things really helped:
First, my experience with feature engineering to use tabular data as input to Deep Neural Networks (DNNs) was really helpful.
Second, my experience of dealing with Transformer models in the Predicting Molecular Properties competition.
Thinking back to a year ago when you joined Kaggle… how did you first get started?
S: Working in the e-commerce field, you’re exposed to a lot of tabular data. However, I was mostly working with computer vision and natural language processing and was not familiar with how to deal with tabular data. I decided to compete in Kaggle because there were a lot of competitions using tabular data, and I could learn how to work with it.
Safe to say you learned!
And what made you decide to enter the Data Science Bowl?
S: To be quite frank, the prize money had the biggest impact on my participation. 😊
(The Data Science Bowl offered a $160,000 total prize pool!)
Let’s get technical
Did any past research or previous competitions inform your approach?
S: Transformer model is a model that is being used successfully in natural language processing. In particular, Transformer-based BERT is the latest technology in natural language processing.
Recently, we were inspired by this and were trying to apply the Transformer in other fields. The top three teams of the recent Predicting Molecular Properties competition all used Transformer.
What preprocessing and feature engineering did you do?
S: The figure above shows the log of one user (installation_id) on the app. A total of 17,000 user log data are provided for training.
The objective of this competition is to look at a user’s past records and predict the value of this user’s accuracy_group.
Aggregation by game_session.
I treated the log data as sequence data because it was recorded in chronological order. Models for dealing with sequence data include LSTM and Transformer, which are being successfully used by NLP. However, you cannot use infinitely long sequences because of the model’s performance and resource problems.
Therefore, in the case of user (installation_id), the log data at times had to be reduced since it was close to 58,000.
The code below is an example of the code used for aggregation:
As shown in the figure below, the length of game_sessions is reduced to 1, which dramatically reduces the length of one installation_id.
What supervised learning methods did you use?
S: The TRANSFORMER Model
For more information on the Transformer model, refer to the “Attention Is All You Need” paper or a well-organized blog on the Internet.
I only want to introduce the features of the Transformer model required in this competition.
The Transformer model has been used successfully in the Natural Language Processing (NLP) field. One of its important features is being able to encode a continuous sequence like [A, B, C, …, Z] into one vector.
Note that in NLP, the whole [A, B, C, …, Z] sequence can be considered to correspond to one sentence, and each alphabet corresponds to each word of a sentence.
The Transformer (TR) can be stacked in multiple layers to encode more abstract information. The figure below shows an example of adding only one layer.
Transformer applied at the 2019 DSB
The input of the Transformer in NLP is a sentence consisting of several words.
Similarly, the input of TRANSFORMER for DSB can be considered as an installation_id consisting of multiple games_session.
- sentence = [word 1, word 2, word 3, …, word N]
- installation_id = [game_session 0, game_session 1, …, game_session N]
The figure below shows a block of a Transformer model that receives an installation_id, compresses the information, delivers it to the Regression Layer, and predicts the accuracy_group in the Regression Layer.
Creating an embedding from game_session
There are two types of tabular data: categorical and continuous. The processing method varies depending on the type of column of the tabular data.
If the column is a categorical type: Embed using the embedding layer and concatenate all of them to obtain a cate_emb vector. For the cate_emb vector, modules made with a linear layer can be used for dimension reduction as shown below, since the size of the dimension is large.
If the column is a continuous type: The cont_emb vector is obtained by using the linear layer directly; the following modules are used.
Creating a joint embedding
Concatenate the cate_emb vector and cont_emb.
Obtain the sequence_output by inputting seq_emb as obtained previously into self.encoder, an instance of the Transformer model as shown in the figure above. Then you can obtain pred_y, the prediction of accuracy_group, through self.reg_layer.
For more information, please refer to this disclosed code.
How did you spend your time on this competition?
S: One-quarter of the time was invested in feature engineering, half of the time in model architecture design, and another quarter of the time in tuning model parameters.
What does your hardware setup look like?
- CPU: 3 x AMD® Ryzen 9 3900X (3 PCs)
- GPU: 5 x NVIDIA RTX2080Ti 11G (2 GPUs in 1 PC)
- RAM: 64G
- The above is just my PC spec. In fact, GTX 1080 is enough for training.
Words of wisdom
What have you taken away from this competition?
S: Most of the participants in the competition appeared to have used the tree-based model. Meanwhile demonstrated that just using neural networks alone could take me to the top.
In particular, I was pleased with being able to refine my skills in embedding categorical and continuous data in this competition.
Looking back, what would you do differently now?
S: I regret that I wasn’t able to use the game time interval, more specifically the time interval between each game_session, as a feature. It wasa feature used by another competitor, and it looks quite useful.
Do you have any advice for those just getting started in data science?
S: Kaggle has a lot of quality resources. It’s always very useful to view the notebook that received the most votes on the notebook tab. Also, the methodology obtained from Kaggle is very practical, so it is applicable even at work!
Thanks for that.
And thanks so much for taking the time to share your wisdom with us!
If you liked this interview, show Sanghoon some👏👏👏!
Take a look at most recent competitions at: kaggle.com/competitions