French Largest Data Science Challenge Ever, Shows The Unreasonable Effectiveness Of Open Data
The headline of this post is an homage to the famous paper : “The Unreasonable Effectiveness Of Data”, itself an homage to the famous paper “The Unreasonable Effectiveness of Mathematics in the Natural Sciences”.
Playground
Datascience.net is the first french-speaking data science platform, launched one year ago by a pool of data specialists. A bit like Kaggle, it bridges the gap between organizations having complex data-centric problems, and the best data scientists willing to solve them.
Several challenges have already been hosted by Datascience.net : predict if a patent will be granted, or the guess the selling price of a second-hand car, are some examples of issues previously addressed by the french data geeks.
Challenge
The last challenge published on the french data science site, was organized by SNCF — the French National Railway Company. The goal ? To predict the number of passengers attending 105 commuter train stations, in Paris and its suburbs.
This is a crucial problem. Indeed, building a predictive model to answer this question can help the national company to decide where, and which, infrastructure investments are required, based on data-driven facts, or to discover inconsistencies in the stations network, that have never been pointed out before. After all,
(big) data is nothing more than a tool for capturing reality — in some clearer and more accurate ways than we have been able to do in the past (ref.)
Battle
The challenge rules were deliberately very permissive : to build the machine learning model, participants were allowed to collect data from every imaginable open data source. And that’s where lies the magic part of the story.
Indeed, more than 400 data hackers met the challenge — almost twice the standard audience — and used their creativity to crack the problem. Machine learning lies on so-called features — here, characteristics directly or indirectly linked to each train station. The more you have features, the more your algorithm can capture tiny clues leading to an accurate predictive model.
Now let’s illustrate the imagination of the winners, by showing a blend of the best mined open data sources :
Some classical ones, like :
- population per city
- train timetables
- type and number of equipments per station (elevators, escalators, …)
Some unexpected ones, like :
- students commuting flows, between home and study place
- workers commuting flows, between home and work place
- the last presidential election detailed results
- tourism accommodation capacity
Some difficult to grab ones, like :
- number of station platforms, scrapped from wikipedia
- size of the stations’s wikipedia page
- number of parking spaces, scrapped from the official stations pages
Some crazy ones, like :
- number of Foursquare checkins
- number of google results for the search “name of the station”
- indicator revealing if the train station has an english wikipedia page or not
Some re-created ones, like :
- distance between the train station and the center of Paris
Epilogue
By smartly using all these datasets — finding the data is one thing, knowing how to select and use it, is an other — the best data gurus achieved a near perfect prediction model. And it could probably be even further improved, by merging the solutions of all the contestants.
Now, for the French Railway Company, after this major step, it’s all but a closed deal. Based on these discoveries, and this asset, what is the next best action to take ? How to collect the value of this work ? Sometimes (often ?),
The problem is not “which model do we choose” but “what action do we take” (ref.)
For fresh data stories, you can follow me on twitter : @chris_bour
You may also like to read :