Overfitting the Leaderboard in Ernst & Young Data Science Competition 2019

And subsequently losing 8000 USD + a ticket to New York.

Published in

HMIF ITB Tech

7 min readJul 15, 2019

*A bittersweet journey at the latest Ernst & Young NextWave Data Science Competition 2019.*

The image above depicts the dramatic free fall of our rank in the global leaderboard. A lot was at stake here. We have sacrificed 40 days pursuing 8000 USD and a ticket to New York — the 1st place prize for the global leaderboard. And to make things even more so anticlimactic, we had dominated the public leaderboard ever since the competition began. Since two weeks before the deadline, our position was set in stone at the 1st place on the leaderboard until the bitter truth, that is the private leaderboard.

Fortunately for us, while we were down to 9th place in the global leaderboard, we were still 1st place on Indonesia local leaderboard. We were rewarded with Rp 25.000.000,- (~1800 USD) and a paid internship offer. Thank you, EY!

So, how did we end up there? Let’s join me on our (Christian Wibisono and I) journey at the latest Ernst & Young NextWave Data Science Competition 2019.

Don’t count your chickens before they hatch — the higher your expectations, the greater your disappointment. source

The Competition

tl;dr: An awesome international data science competition held by Ernst & Young. Competing against undergraduate and postgraduate from all over the world for 40 days. Given anonymized geolocation records, predict whether the device will be within the city center or not.

The Ernst & Young NextWave Data Science Challenge 2019 focuses on how data can help the next smart cities thrive, and boost the mobility of the future. Global urbanization is on the rise, with more than 50% of the world’s population living in cities. According to the UN, that number will reach 60% by 2030 — that’s nearly 1.5 billion more than in 2010.

While this trend creates excellent opportunities for cities, it also presents challenges to governments on how to upgrade infrastructure, alleviate congestion, and address pollution. Electric and autonomous vehicles, along with the explosion of the ride-sharing economy, are helping to address these challenges. Furthermore, they also disrupt mobility and demand innovative solutions.

In parallel, public authorities have more information than ever on how citizens move around in the city. However, a gap exists between having this data and using it to improve the user travel experience for citizens. Forward-looking authorities have a chance to innovate infrastructure to make their city a better place to live in a better working world.

Here was our chance to narrow that gap. As a challenge participant, we would be able to download a dataset with a vast number of anonymous geolocation records from the US city of Atlanta (Georgia), in October 2018. Our task was to produce a model that helps authorities understand the journeys of citizens while they move in the city throughout the day. Hopefully, our work could inspire solutions that help city authorities anticipate disruptions, make real-time decisions, design new services, and reshape infrastructures.

Data illustration. Source: Challenge Manual

We were given access to a data file `data_train.csv` that contained anonymized geolocation data of multiple mobile devices in the City of Atlanta for 11 working days in October 2018. Every device ID represents a 1-day journey, which consists of multiple trajectories. A trajectory is defined as the route of a moving person in a straight line with an entry and an exit point.

We must predict how many people were in the city center between 15:00 and 16:00. The test dataset contained several devices in which the trajectories after 15:00 had been removed — all but one: After 15:00, we would find one last trajectory, with (1) entry location, (2) entry and exit time between 15:00 and 16:00. The exit point, however, had been removed. Simply put, our task was to predict the location of this last exit point and whether this device was within the city center or not. The target variable is the latter.

For 40 days, every day each team — consisting of up to two persons from undergraduate or postgraduate all around the world — had a chance to submit five submissions containing the target variable and its device ID. The submission would be automatically evaluated using F1-score between the predicted and the observed target. The results would be compared to the real data using both public and private datasets. In the challenge ranking, we would be able to see the score based on only the public dataset. In the final leaderboard, only the last submission will be considered.

The global leaderboard hours before the deadline.

What Went Well?

In retrospect, these are things that allow us to perform quite well.

Collaboration

Winners are those who went through *more iterations* of the “loop of progress” — going from an idea, to its implementation, to actionable results. So the winning teams are simply those able to run through this loop *faster*. — François Chollet, creator of Keras.

The above quote resonates strongly with us. At the end of the competition, we had 144 submissions and only missed a couple of days because the timeline heavily overlaps with our final semester exam and assignment. Each day we would plan how many of the five available daily submissions each of us was going to take. And for each submission, we wrote down details such as data version, list of feature, classifier, cross-validation score, and leaderboard result in Quip. We also structured our repository & code in a way that allows us to swap things out at GitHub easily. By teaming up, we manage to try more thoughts and ideas.

Kaggle kernel

One of the most significant constraints competing in a data science competition is computing power. In this competition, the data size is close to 200MB initially and over 400MB after feature engineering. That amount of data is just way too much for my poor laptop — it would quickly overheat and scream in pain.

Fortunately, Kaggle, an online website of data scientists and competition, provides an awesome feature called Kernel, FOR FREE. Kaggle kernel is basically like having your Jupyter Notebook/script run on the cloud. It has excellent specs and allows your script to run up to 9 hours.

Simply upload your dataset and script to Kaggle, and you can run up to 8 experiments simultaneously! Having this FREE (did I forget to mention that?) computing power really helped us as each experiment from start to end could take more than 2 hours.

Feature engineering

This one is critical as it directly corresponds with how well our machine learning model is going to perform. Given the challenge was to predict whether a device was within the city center or not, we figured out that distance was going to play a significant role. So, one set of features we computed was the following:

First, we calculated the duration of each trajectory.
Then, we computed the distance traveled for each non-final trajectory.
Given both the total duration and total distance traveled of non-final trajectory, we could compute the average velocity of each device.
After that, we computed the distance needed to exit/enter the city center of final trajectory.

5. Finally, we could calculate using the formula on the left to indicate how likely the device could exit (or enter, depending on the value given) the city center based on its past average velocity and duration of the final trajectory.

This set of feature ended up as our top most impactful feature based on SHAP. Another noteworthy one was a set of features that indicate its tendency to stay inside or outside the city center.

What Went Wrong?

It was the cornerstone of an inept data scientist, stemming from a complete lack of experience. The main ingredient of a classic shake-up between public and private leaderboard score:

We did not trust our cross-validation score

— and the result was catastrophic. We had overfitted the public leaderboard.

It’s easy to say that you believe in something until it doesn’t align with what you see. In our case, we believed in our cross-validation score as it was in line with our leaderboard score — until it wasn’t. At that point, any experienced data scientist competition enthusiast or Kaggler would know that it was perfectly reasonable and possible. Our lack of experience caused us to falter and eventually lose the position.

The public leaderboard was calculated with only a fraction of the total test data, which after the competition we know was around 1/3 public and 2/3 private. In hindsight, we should have always trusted our local cross-validation, given that we had a sound implementation, to guide our way. This incompetence caused us to possibly drift from a better solution and fail to choose our best solution as the final one.

Epilogue

It wasn’t all that grim. It was a fantastic competition, and we learned a lot during the process. In the end, we ended up with a handful of cash and a whole lot of experience.

I’m eternally grateful to everyone who made this possible. Thank you Ernst & Young for organizing this event. And finally, thank you for reading. I hope this story can be useful for you.