Lessons learned from driving (March) mad 🏀 📈 🐍

6 min readMar 17, 2018

TLDR: Kaggle is fun, embrace Bayesian thinking, use docker, code on GitHub

In this post I’d like to share the main lessons I’ve learned from participating in this years March Madness Competition on Kaggle:

Google Cloud and NCAA® have teamed up to bring you this year’s version of the Kaggle machine learning competition. Another year, another chance to anticipate the upsets, call the probabilities, and put your bracketology skills to the leaderboard test. Kagglers will join the millions of fans who attempt to forecast the outcomes of March Madness® during this year’s NCAA Division I Men’s and Women’s Basketball Championships. But unlike most fans, you will pick your bracket using a combination of NCAA’s historical data and your computing power, while the ground truth unfolds on national television.

Lesson 1 — Embrace the Bayesian way 🎲

If you take a look at the Kaggle Kernels you will see a lot of “Swiss army knife” solutions. Ask yourself if one of the following code snippets looks familiar to you :-)

Don’t get me wrong. Those approaches and frameworks are awesome and most of the time they do the job. Obviously it was due to my last read The Signal & the Noise that I felt the urge to try some “new” approaches solving data science related problems — Bayesian approaches to be specific.

What are “Bayesian approaches”?

An explanation of Bayes thinking in layman’s terms would be:

Have an initial gut feeling about a situation (a.k.a prior belief)
Make some observations
Update your initial gut feeling based on the observations (a.k.a. posterior belief)

The idea was to start estimating a winning probability for a certain matchup in the first round. After inferring the most likely winner that information gets incorporated into the next matchup repeatedly until the final game.

Due to the lack of time I was not able to really to go down the Bayes route — however I was able to lay the foundation for a model that is simple to update after each match using beta distributions.

What are beta distributions?

Beta distributions offer the following advantages:

They are easy to update after observing new information
Their parameters are easily transferable to real world situations
They offer a probability distribution instead of just a point estimate and therefore capture the uncertainty (also a concept in Bayesian thinking)

In contrast to the normal distribution, the beta distribution is not that widely known if you are not into a statistical field of studies. It’s probability density function can look a bit intimidating in the beginning, but once you know how to interpret it’s parameters it becomes a handy tool.

In a nutshell the Beta distribution takes two parameters: α and β. One way to understand those variables is to

see α as the number of successes
see β as the number of losses

The probability density function will give you a probability distribution about the probability of success .Here is an easy example to grasp the main concept:

Imagine you are the coach of a basketball team and you have to draft a new player into your team. There are two promising candidates and you would like to evaluate their performance based on the ability to score three-pointer.

Player A throws 100 balls and scores 60 times, resulting in 60% accuracy.

Player B throws 2 balls and scores 2 times, resulting in 100% accuracy.

If you would solely go by accuracy you would take player B, but since player B only threw twice, how trustworthy is his 100% accuracy score? The beta distribution captures exactly that kind of uncertainty as the following figure shows:

As the peak indicates, you can be very certain, that Player A has a ~60% chance of scoring three pointer. Opposed to that, the probability curve of Player B is rather flat, expressing the uncertainty whether the 100% score is luck or skill.

If you are interested in more technical details on beta distributions (esp. in Python), I encourage you to have a look at this Beta Distribution cheatsheet or one of the Notebooks that covers the odd derivation.

How well did those statistical approaches perform?

After the first games of March Madness this approach is currently in the top 2% of the Kaggle Leaderboard and doesn’t seem to be too bad. To be honest I am quite sceptical if this performance will be sustained throughout the tournament but for now there is not too much to complain. The next steps would be to not only incorporate wins & losses but also other features as random variables and see how those distributions can be blended.

Lesson 2— Docker is dope 🐳

Wait… I’m a data scientist — not software engineer!

Why is Docker as an containerisation tool mentioned in a post, that covers data science approaches and analytical models? Well both disciplines have (at least) one major concept in common: deliver high quality.

As one aspect of quality, the term “science” implies the idea of experiment reproducibility. Over time the Jupyter Notebooks will grow, and keeping track of every step becomes harder and harder — especially if you are working in a team with different setups.

How to make experiments portable & reproducible?

Kudos to Peter Bull & Isaac Slavitt and their 2016 talk “Data science is software” who propose a lean, flexible and extensible setup for those projects (take the time and watch that video — it will definitively enrich your mindset).

Inspired by that idea, there is are only two commands I have to run in order to fetch the data & derive all necessary feature: git clone and make data .

Well that was not the complete truth. Since the Bull’s & Slavitt’s talk nicely covers the setup inside of Jupyter I wanted to extend the concept and make the whole infrastructure as portable as possible. This is where docker comes into play.

Orchestrating the application components with docker-compose just adds two additional commands to go from scratch to the full setup including derived data:

git clone >docker-compose build >docker-compose up > make data

What is the overall architecture?

The resulting setup is quite simple as shown below:

The Jupyter container is the main entry point to fetch data, derive new features and ultimately build the models. The Postgres container works as a storage backend and also als a “broker” between Jupyter and the Superset container which can be seen as the visualisation component (and is simply awesome as the video snippet below shows).

Apache Superset rocks when building easy to use and interactive Dashboards 💪

How can I start using this architecture?

Here is a docker template of this composed setup that you might want to bootstrap in your next project. If you would like to dig deeper on the general question “how containerisation can improve your analytical workflow” I would highly recommend going through Docker for Data science. It is easily understandable, covers the main concepts and is focused on hands-on content.

I hope you enjoyed reading. Maybe that post offered you a bit food for thought and one or the other resource comes in handy for you. You might want to take a look at the GitHub repository, covering all the concepts mentioned in this post. For my part, I am keen to see how far the Bayesian approach works out in this year’s March Madness…