I’ve been trying to reason about some of the things I’ve been observing in the data science world over the last couple of years:
- An explosion in courses, textbooks, talks, and blog posts on “Advanced Methods” (Eg. deep learning, xgboost, nonparametric empirical bayes, elasticnet regression etc.).
- Interviewing many candidates with expertise in the above who exhibited a lack of the analytical reasoning required to make good business or product decisions, with some being unable to work through foundational examples of coin tosses and dice rolls (I know that some folks don’t prefer this line of interview question- but these toy scenarios form the basis of many real world use cases from hospitalization risk to fraud prevention. For this reason it’s hard for me to trust those who can’t work through them.).
- The bulk of the value generation I’ve seen came from a unique and creative insight around the treatment of the data. The specific method used afterwards (whether an ML algorithm or statistical test, depending on the use case) contributes at the margins. (This isn’t universally true- there are cases of true experts, a particularly well matched problem, and just the right data, where an algorithm can produce magic. But this situation is rarer than we might imagine.).
Here’s what I came to: Methodologies are the vanity metrics of Data Science.
What do I mean by this? Vanity metrics are by now a well known phenomenon in the startup world- they are metrics that make you feel good about yourself, but don’t actually reflect the underlying health of your business. Metrics like Page Views and User Sign Ups are canonical examples- it feels nice when they go up, but do more page views or more sign ups directly translate into revenue? Usually not- there are more actionable things you can measure that are much more indicative of a healthy growing business, such as measures around the funnel (eg. conversion rate) and user engagement (eg. monthly active users). The key takeaway being that if you run your business optimizing for vanity metrics, you will fail.
I think methodologies play the same role for a data scientist. When you work on learning new methods (Now I know Random Forest! Now I know K-L Divergence! Now I know Deep Learning!) it feels good- you’re exercising your brain, you know something you didn’t before- and it’s easy to think you’re progressing. But methods don’t in and of themselves drive value. As a data scientist that is your first (and only) job- to drive value. And there is really only one way to do that: solving valuable problems. This feels to me like the root of the mismatch- methods are tools with which to solve problems that generate value, they are not ends in and of themselves:
- Amazon doesn’t use ML and statistical techniques to solve the problem of “finding you similar items you might be interested in”. It uses them to solve the problem of “getting you to buy more things on Amazon once you’re here”.
- Similarly with LinkedIn and Facebook- they don’t use advanced methodologies to solve the problem of “finding you people you may know” or “showing you interesting and relevant content”. Rather they use them to solve the problem of “getting you to spend more time in our app, especially if you’re new”.
- And Google hasn’t been developing new techniques to solve the problem of “semantically organizing the web” or “categorizing the world’s information” or even of “finding you the most relevant search results”. Rather they do so to solve Google’s problem of “have you execute all of your searches on Google so we can show you ads (and track you across the web, to better target you with our ads)”.
The common thread here is that each company was using methods to solve a problem (or problems) of high value. In some cases the method they chose was the best or only option, in others there may have been different approaches that would have worked just as well. And the company, from Google to Facebook to the corner store down the the street, would have used any of them- it is agnostic to the solution implemented as long as the problem gets solved.
None of this is to say that you should not put effort into developing technically- understanding methodologies can open your mind to what is possible. But it is to say that to become a great Data Scientist you need to be able to identify and solve problems of tangible value to other people. Technical skills and advanced methodologies are among the tools you should bring to bear on this, but if that’s where you’re spending all of your time it will be hard to find success.
Find helpful resources (and more) at worldlybayes.com.