pip install Data Science

Or why data scientists might be trained and assessed on the wrong skill set

This is my first Medium story, 100% inspired by my own (approx. 10 years long) experience as data miner, analyst, scientist, product manager, architect... So feel free to reach out if you think I missed something !

20 years ago the combination of the Internet, mobile devices, digitalization of services and storage costs decrease lead companies to start storing huge amounts of data. Some of them, the “pure players”, born with and for the Internet quickly realized that it could have a huge value (e.g. for brands when considering Google search requests). Their profits quickly accumulated which brought “real world companies” (retailers, banks, car manufacturers, newspapers, etc.) to think they were also sitting on a new Gold mine… Data could easily improve their products, customer experiences or internal processes, maybe the three of them simultaneously ! That is how 5 to 10 years ago Data Labs, Data teams and others Data departments supposed to pick all these new low hanging fruits emerged and started to hire the first Data Scientists.

Since then, the demand for Data Scientists, described as “the sexiest job of the 21st century” by the Harvard Business Review has not decreased… And we can observe today (sorry in advance for my “French-only” figures) :

  • tons of opened positions waiting for Data Scientists(1 600 in France according to a quick search on indeed.fr, slightly less than bakers or butchers, and approx. 0.35% of all opened positions)
  • tons of data science courses : both online, you will find more than 800 related courses on Coursera (to be compared with 200 lessons to learn English) and offline : 45 training courses for the only Paris-Saclay University
leave the search bar blank… and the most appreciated courses on Coursera will appear
  • tons of students arguing to be data scientists : I received for this internship more than 150 applications…

So a new industry is born, leading to new growth and new jobs, all is well in the best of all worlds… Well globally speaking YES for sure. But for months I have had a bad feeling when meeting or interviewing some junior Data Scientists, as if their future job had been misinterpreted by some of them. That feeling might be marginal and only related to me and a kind of old way of thinking but I am not sure.

Let me try to explain. There are different phases in a data project :

1/ Define the problem to solve : find a potential business opportunity, turn it into equations

2/ Gather data and refine them : collect the data, understand them, study correlations, deal with bad quality

3/ Calibrate a machine learning model : in other words build the best model (chosen among a vast collection of decision trees, gradient boosting, random forests, …) fitting the data.

4/ Analyze results, propose improvements (and go back to 2/… many times)

5/ Answer the question : Is there here a business opportunity ?

6/ If YES, Industrialize the approach (automatize data pipeline, plug model outputs to existing processes / IT)

My main concern is that most of people (I would say 2 thirds of candidates) consider Data Science as the art of mastering step 3/ to the detriment of other phases (at least 2/ 4/ and 5/, 1/ and 6/ requiring a little practice). And this is risky both for them and companies.

Risky for Data Scientists. Indeed step 3/ is being more and more automatized, mastering it will soon be useless. Thus a Random Forest which is a pretty complicated mathematical object (as described for e.g. in this article) can be built in less than 10 line of codes. How to do that ? Simply install (typing the famous “pip install my_library”) one of the existing open source data science libraries (e.g. scikit learn) and use the dedicated function.

building a random forest with scikit-learn

Well you could argue there are many eligible models (ie not only Random Forest), that each model comes with its own set of parameters that have to be optimized or that now people are building combinations of models, neural networks… well OK but it is just a matter of time… In fact many libraries already exist to do part of the stuff mentioned above. Thus Uber released very recently “Ludwig, a Code-Free Deep Learning Toolbox”. You can even find click and buttons interfaces offered by some software companies (such as Dataiku) to do the job !

Dataiku Automated Machine Learning offer

Risky for companies. Hiring low performers in all steps excepted 3/ leads to all drawbacks inherent to using black boxes : no one knows if it really works, why it works, how it is compared to the current processes, what are inherent risks and limits... In the best case the Data Scientist fails to convince his/her colleagues to go further, in the worst case the model goes in production… and fails.

What to do now ? Here is my 2 cents.

As a recruiter I not only check results in Kaggle and ability to code in Python but also try to assess

  • “business intuitions” (if I ask you to model such event, which data would you need ? What according to you should be the most important features ?etc..)
  • “modelling intuitions” (What happens if we remove this feature ? What if we change from this model to this model ?),
  • Curiosity (Is there something about our business you did not understand ?)
  • and rigor.

As a big company, I would also open my data scientists positions to business / data analysts… Some of them might be good at all steps excepted step 3/.

As a future data scientist I would spend time

  • Reviewing my statistics classes
  • trying to understand how models work mathematically speaking
  • self-criticizing my models (what if I add / remove this feature, change this meta parameter, removes those rows of my data set…)

And once this is done, I would send my resume to Paylead :-). Well that’s the end for today, hope you enjoyed your reading, kind regards

Alexis