French Largest Data Science Challenge Ever, Shows The Unreasonable Effectiveness Of Open Data

The headline of this post is an homage to the famous paper : “The Unreasonable Effectiveness Of Data”, itself an homage to the famous paper “The Unreasonable Effectiveness of Mathematics in the Natural Sciences”.


Playground

Datascience.net is the first french-speaking data science platform, launched one year ago by a pool of data specialists. A bit like Kaggle, it bridges the gap between organizations having complex data-centric problems, and the best data scientists willing to solve them.

Several challenges have already been hosted by Datascience.net : predict if a patent will be granted, or the guess the selling price of a second-hand car, are some examples of issues previously addressed by the french data geeks.

Challenge

The last challenge published on the french data science site, was organized by SNCF — the French National Railway Company. The goal ? To predict the number of passengers attending 105 commuter train stations, in Paris and its suburbs.

This question can help the national company to decide where, and which, infrastructure investments are required

This is a crucial problem. Indeed, building a predictive model to answer this question can help the national company to decide where, and which, infrastructure investments are required, based on data-driven facts, or to discover inconsistencies in the stations network, that have never been pointed out before. After all,

(big) data is nothing more than a tool for capturing reality —in some clearer and more accurate ways than we have been able to do in the past (ref.)

Battle

The challenge rules were deliberately very permissive : to build the machine learning model, participants were allowed to collect data from every imaginable open data source. And that’s where lies the magic part of the story.

More than 400 data hackers met the challenge — almost twice the standard audience

Indeed, more than 400 data hackers met the challenge — almost twice the standard audience — and used their creativity to crack the problem. Machine learning lies on so-called features — here, characteristics directly or indirectly linked to each train station. The more you have features, the more your algorithm can capture tiny clues leading to an accurate predictive model.

Now let’s illustrate the imagination of the winners, by showing a blend of the best mined open data sources :

Some classical ones, like :

Crazy open data sources used by the winners : Foursquare Checkins

Some unexpected ones, like :

Some difficult to grab ones, like :

  • number of station platforms, scrapped from wikipedia
  • size of the stations’s wikipedia page
  • number of parking spaces, scrapped from the official stations pages

Some crazy ones, like :

  • number of Foursquare checkins
  • number of google results for the search “name of the station”
  • indicator revealing if the train station has an english wikipedia page or not

Some re-created ones, like :

  • distance between the train station and the center of Paris
Dataviz by @mattsco, showing the geolocalized performance of its model — Use of the Dataiku Data Science Studio

Epilogue

By smartly using all these datasets — finding the data is one thing, knowing how to select and use it, is an other — the best data gurus achieved a near perfect prediction model. And it could probably be even further improved, by merging the solutions of all the contestants.

Now, for the French Railway Company, after this major step, it’s all but a closed deal. Based on these discoveries, and this asset, what is the next best action to take ? How to collect the value of this work ? Sometimes (often ?),

The problem is not “which model do we choose” but “what action do we take” (ref.)