Friends from Uni team up & give back to early education — making everyone a winner | A Kaggle Winner’s Interview

Kaggle Team
Feb 26 · 4 min read

Congratulations to the winningest duo of the 2019 Data Science Bowl, ‘Zr’, and Ouyang Xuan (Shawn), who took first place and split $100,000!

Image for post

The Data Science Bowl, presented by Booz Allen Hamilton and Kaggle, is the world’s largest data science competition focused on social good. Each year, this competition gives data scientists a chance to use their passion to change the world. Over the last four years, more than 50,000+ competitors have submitted over 114,000+ submissions, to improve everything from lung cancer and heart disease detection to ocean health. This year, competitors were challenged to identify the factors that matter most to predicting player capability in an educational kid’s game by PBS. For more information on the Data Science Bowl, please visit DataScienceBowl.com

Image for post
Photo by Sven Brandsma on Unsplash

Now, let’s hear from our first-place winners…

So, tell us about yourselves!

Zr: I graduated from Zhejiang University and am now working as an NLP engineer. I love mathematics and statistical machine learning. 😄

Worth noting you’re also a Competition’s Grandmaster?

Zr: Yes.

Shawn: I also graduated from Zhejiang University, China, in 2019, where Zr and I were classmates. Now, I work on semantic computing in natural language processing.

Did you have any prior experience that you feel helped you win this competition?

Zr: Before this competition, we had participated in several Kaggle competitions. We took part in “Home Credit Default Risk” and “Avito Demand Prediction Challenge” together. We learned a lot from Kaggle — about data analysis, feature engineering, model ensemble, and so on. These prior experiences helped us succeed in this competition.

What made you decide to enter this competition?

Shawn: We noticed the Data Science Bowl as soon as it launched. We thought it looked both challenging and fun, but more importantly, we agreed it would feel very meaningful to help childhood education through data science.

The competition provided a large amount of gameplay log data, and there was a large space to practice our feature engineer skills. Really, we hoped to contribute to education through our talents. So, choosing this competition was a no-brainer.

Let’s get technical.

Did any past research or previous competitions inform your approach?

Shawn: Yeah, several previous competitions helped us this time around. The idea of metric optimization is inspired by the kernel shared in “PetFinder.my Adoption Prediction.” The feature selection method is inspired by Olive’s Notebook in “Home Credit Default Risk”. All previous competitions were great learning material for us.

What can you tell us about any preprocessing or feature engineering?

Zr: We generated features stats across different time window previous sessions, e.g. last 5 /12 /48 hours, from the last assessment to the current assessment. And for each session, we generated features to describe the kid’s behaviors, such as the session’s accuracy, how long do the kids finish each event, if the kids skip the video, etc.

How about supervised learning methods?

Zr: The most critical model we used is LightGBM, which is a gradient boosting framework that uses tree-based learning algorithms. It’s super-fast and accurate.

What was your most valuable insight into the data?

Shawn: There was a lot of information in the gameplay data. We quickly identified that correct attempts, actions after system feedback, and time duration for each event, etc. will reflect kids’ player strategies and understanding.

Which tools did you use?

Shawn: We mainly used python, LightGBM, and CatBoost. Oh, we also used Jupyter for our IDE, it’s really convenient.

How did you spend and split your time?

Zr: We spend most of the time in feature engineering. Proportion? Maybe 80% of the time.

We’re curious about your hardware setup. What can you share?

Zr: We used a server, which has 24 cores and 128G memory.

What was the run time for both training and prediction of your winning solution?

Zr: It was about 2–3 hours in our winning solution. And for a simple model (use 500 features and score 0.569 in private leaderboard), the running time reduced to 8 minutes.

Wow.

It sounds like you’ve got this dialed-in! Anything you’d change, though?

Shawn: Looking back, we may have spent more effort on neural networks, and ensembles with tree models to get a better score.

Helpful! Okay, last question. Do you have any advice for those just getting started in data science?

Both: Join a Kaggle competition and enjoy it!

Aw, shucks! (we recommend getting started with this Titanic tutorial, by the way!)

If you liked this interview, and want Zr and Shawn to know it, give this article some 👏👏👏!

Kaggle Blog

Official Kaggle Blog ft.

Kaggle Team

Written by

Official authors of Kaggle winner’s interviews + more! Kaggle is the world’s largest community of data scientists. Join us at kaggle.com.

Kaggle Blog

Official Kaggle Blog ft. interviews from top data science competitors and more!

Kaggle Team

Written by

Official authors of Kaggle winner’s interviews + more! Kaggle is the world’s largest community of data scientists. Join us at kaggle.com.

Kaggle Blog

Official Kaggle Blog ft. interviews from top data science competitors and more!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store