The term data scientist has been used lately to describe a wide variety of skills & roles. In this post I will focus on a particular flavor of data scientist. I will talk about the qualities needed to be a good data scientist-engineer who ships relevance products to users. Some examples of relevance products are:
- Search engines like Google, Bing, Foursquare, Yelp.
- Recommendation systems like Netflix movie recommendations, Amazon “what to buy” recommendations or Twitter “who to follow”.
- Smart news feeds like Facebook or LinkedIn
These folks need to be strong at data science and engineering to be successful. In fact, relevance engineer might actually be a better term to describe these data scientists*.
Relevance engineers have a common set of skills that they draw upon to get their jobs done. The list below doesn’t include some of the known, obvious skills. You obviously need to be smart. You obviously need to have (or be able to learn quickly) the required “book” knowledge.
But beyond that, there are a bunch of not-so-obvious skills that you can’t learn from a book. Here are some of those, in no particular order:
- You need to enjoy an iterative process of development. If you want to build a relevance-based software feature**, you need to be able to build a version 0.1 using a very simple model quickly. Then iterate on getting it better at every successive stage.
- You also need to have a good intuition for when to stop. By definition, relevance features are never done. You can always improve the accuracy a little more. But at some point, the effort you put in exceeds the value you derive from it. You need to be able to identify that point.
- You should be comfortable with failure. A lot of your models & experiments will fail. And that’s ok.
- You should be driven by curiosity. The best people are the ones who are genuinely curious about the world around them.
- You need to have a good data intuition. You should be good at identifying patterns in the data. Being able to create quick data visualizations (using R, Python, Matlab or Excel etc.) helps.
- You need to have a good sense of metrics and be metrics-driven. You should to be able to establish metrics that define success or failure of your feature. You should feel comfortable with blind experiments and terms like precision, recall, accuracy, ROC, conversion rates, NDCG etc.
- Metrics are great at giving a high level view of how your feature is doing. But at the same time, you should never stop directly looking at individual examples. Manually looking at your biggest wins and your biggest losses (e.g. worst performing queries for a search engine) teaches you a lot about your feature that raw metrics don’t.
- You should be able to develop a generalized approach to fixing the bugs/misclassifications in your models. Fixing individual bugs will only let you attain a local maxima. More often that not, individual fixes will also make your models more complex and harder to work with. Gathering all the issues together and identifying common themes will let you focus on the biggest impact issues that you want to fix in your next round of iteration.
- When developing your models, you should be able to put yourself in the minds of the users of your product. It’s easy to build something that’s good enough for you. But your current & future users matter way more than you do. There are also a lot of biases that affect individual decision making, and you should try to be aware of and account for as many of them as possible.
This list is by no means exhaustive, but does capture some of the qualities of the smartest folks I have worked with. Happy to hear what you think.
*Some companies use the term data engineer, but that can mean other things too (like data-infrastructure engineers). I hope the term relevance engineer catches on, as it is a more accurate description of this role that is becoming increasingly common in the software world.