Kernels, Clicks and Boosted Trees: Highlights from the 1st Google Analytics Kaggle Competition
It was no doubt an interesting encounter, with Google Analytics meeting:
Kaggle (the machine learning competition platform)
Rstudio (the cash prize sponsor)
and Big Query (the data host).
The purpose ? to organise a machine learning competition- the first of its kind having Google Analytics data as its raw material.
The event (in its first phase, i.e. before the redesign due the data leakage, more on this later) attracted more than 3,500 teams, making it one of the most popular competitions hosted in the platform’s 8 year history.
This article is a summary of what happened, coupled with a few thoughts around particular aspects that I found of interest -starting with the Kaggle experience itself, followed by the highlights of this particular competition and closing with the takeaways.
The Kaggle experience
The previous time I took part in a competition was back in 2014, as part of the Analytics Edge, a popular MOOC at the time. I was somehow left with the impression of Kaggle being a playground mainly for people wishing to learn data science.
This time around, judging by the high quality of discussions in the forum and the scripts shared in the public kernels, I found that rather the opposite is true (side note: these days the majority of Kagglers are industry professionals such as data scientists and analytics consultants according to a recent Kaggle survey).
Beyond that and more importantly maybe, Kaggle itself as a platform has evolved substantially. Some recent product developments discussed below have opened up some promising new use cases.
Cloud data science… in a kernel
Something that I was happy to discover was the ease with which one can do data science on the cloud these days. That’s pretty much without having to install or configure anything at all.
All it takes is a one click launch of a so-called Kaggle kernel where it is possible to write and execute R or Python code. Kaggle formally defines a kernel as a cloud computational environment for reproducible and collaborative analysis combining input, code, and output.
The most interesting part here is that the use of kernels is *not* restricted to Kaggle competitions. One can freely use the Kaggle infrastructure to work on a private project alone or with collaborators, using private datasets while benefiting from dedicated cloud resources -storage, RAM as well as CPUs/GPUs.
Caution, risk of Kaggle addiction ahead
Kaggle has done a very good job of gamifying the platform by introducing various statuses of recognition and experience levels ranging from novice to grand master. That’s based not only on competition performance but also participation and contribution in discussions, sharing of kernels or commenting on others’ work. In practice the more involved you are, the more recognition you are going to receive. This can help to unlock useful platform features or simply attract more visibility to your code and forum posts.
Overall I found that taking part was a fresh experience, which I can wholeheartedly recommend. Just a word of warning. Kaggling is a time demanding activity, so better be prepared for that.
So, what was the competition about ?
The provided dataset from Google Analytics contained rich historical session information from users navigating the Google Merchandise Store website aka as the Gstore (yes, this is the secret of how Google actually is making the money).
The target was to predict future revenue for a given set of users for a defined time frame as accurately as possible.
Given the way the problem was framed and the defined target metric:
- a practical first step was to predict which of the users from the dataset were going to convert
- then figure out an estimate for the amount spent.
Not surprisingly for an ecommerce website, the conversion rate of the Google store is in the area of 1–2 %. You can think of this competition therefore as the classic needle in a haystack problem.
For those familiar with the Google Analytics platform, this competition was in many ways like trying to re-engineer the recently launched conversion probability report which does exactly that… i.e. provides the likelihood of future conversion for users based on their historical data.
Blood, sweat and some fears
As mentioned earlier the reception of the competition was exceptionally good considering the high level of participation and the initial excitement. But it turned out not to be a walk in the park neither for the participants nor for the organisers.
Right after the initial launch there were a couple of rounds with issues raised from the teams regarding the consistency in the data and metrics definitions provided by the organisers. Shortly after, the news broke out that a data leakage had taken place.
A data leakage can take several forms but in this instance it was linked to the publicly available Google Demo account (this alternative source of truth for the G store traffic left doors open to mine the answers).
There was a pause and it took a few weeks for Kaggle to relaunch the competition, with an updated dataset and prediction targets.
The result ? The new dataset was much richer, offering fine grained hit level data and thus opening up new opportunities for analysis and model building.
The challenge: It was a multi GB size dataset that was outside of what a laptop’s RAM can reasonably be expected to hold. This made any type of manipulation, let alone modeling on the data in its entirety a challenging task.
The new competition set up also involved some aspects that left many Kagglers wondering if it was going to lead to a less objective evaluation methodology, giving some room for random chance to play a role in the results.
The competition redesign resulted in quite a few teams to leave and never return. The final number of submissions was only a fraction of the number pre-leakage. On the positive side, all these changes challenged those Kagglers who stayed on to get creative and figure out ways to get around those obstacles.
(Ironically Kaggle is often called out for providing data that’s “too nice and tidy”, and therefore atypical of real life cases where data scientists notoriously spend much of of their time just trying to clean and bring the data into the right shape)
The modeling part
Activity in the forums tends to slow down as the competition is about to reach the finishing line. However, there is renewed interest after the submissions deadline. That’s when winners return to the forum to discuss their strategies and tactics, their winning models and methods. In this case the final standings for this competition, won’t be known until February.
Considering the way the problem was framed and its domain, there is no shortage of methods that could be deployed to tackle it. From the classic machine learning models to Markov chains, survival analysis, or as is often the case for winning submissions, combinations of more than one models, put to work together for improved predictive power.
In practice what dominated in the public kernels space were boosted tree algorithms often coming with fancy names such as XGBoost , lightGBM (for Gradient Boosted Machine) and Catboost. They are known for their effectiveness particularly on structured data problems (I personally worked with a tree based algorithm too, the random forests).
Recurrent Neural Nets
What captured my interest the most however — even though there were just a handful of public kernels around it — was the potential of deep neural nets, which could be a natural fit to model this competition’s data, even if this is not the most obvious solution.
Why deep learning for a structured problem like this ?
Deep neural nets are typically used for image classification, text and speech recognition as well as other domains where the data tends to come in unstructured formats.
Thanks to this competition I discovered a not much talked about area where they can find applicability. Yes, this is within digital/marketing/customer analytics context. This is thanks to special neural net structures called Recurrent Neural Nets. RNNs in short, operate on sequences of time stamped events of varying length, retaining memory of what happened in the past. Sounds familiar ?
Those events could be interactions of users with a website. Much like it happens in the area of natural language processing, sessions can be thought of as human sentences. Modeling them as RNNs can help to predict future user behaviour. This is not unlike Gmail these days predicts how we intend to finish our sentence given what we ‘ve said already.
Taking part in the first Kaggle competition with Google Analytics data was great fun. It helped me learn several new things.
Here are my 3 main takeaways:
- Taking part in a competition is a great way to stay up to date with new developments and techniques in the field (such as boosted trees and RNNs) while sharing code and ideas with a diverse group of data science enthusiasts.
- Kaggle kernels (especially now with the new policy for private use) is a great way to get a flavour of what it’s like to work on a data science project purely on the cloud either privately or collaboratively.
- Google Analytics is often thought of as a tool for reporting, business intelligence and visualisation. This competition is evidence that it could also become part of a data science/machine learning workflow.
If you took part in the competition what do you think ? Is there any important aspect that I missed ? If you are a user of Google Analytics, how does the idea of predictive modeling for this type of GA data sound to you ?
I am an independent consultant in marketing analytics and data science, helping conversion-driven digital businesses to make informed marketing decisions. I share my stories about digital, marketing and data analytics -often combined- on my blog and via Twitter and LinkedIn.
Alex Papageorgiou — marketing analytics consultant, ex-Googler - alex-papageo.com | LinkedIn