Interview Questions for Data Scientist Positions

There are loads of books on “cracking” the programming interview, and every computer scientist or software engineer has spent some time hunting down and trying to solve interesting interview problems. But the typical interview problems are not any good for assessing the aptitude of a data scientist. I've personally seen brilliant programmers and software engineers struggle for years with wrapping their minds around machine learning concepts and statistical analysis techniques. It is clear then that the job interview for a data scientist needs to have questions and problems specifically designed to gauge these abilities.

These are some questions I came up with when I was asked to conduct interviews for “Research Engineer” positions, please feel free to give feedback and send your own questions to augment this list.

Modeling:

What is the simplest possible classification model you can learn from data?

I've seen time and again that some ML practitioners are used to using sophisticated algorithms (e.g. SVMs, Gradient Boosted Trees, etc.) and have very tenuous grasp of simpler modeling techniques. I believe this is a critical blind spot. Simple modeling techniques serve as good, solid baselines, are less prone to overfitting and are easier to implement on a large to huge scale in online environments. The simplest classification model that can be learned from data is a simple threshold on a single feature. The next step in complication is a linear model linking the target variable to multiple predictors or a single decision tree. A candidate should be able to write the algorithm to tune any of these models in 10 minutes or so.

What are your favorite Machine Learning algorithms and why?

This is an inherently biased question, since every machine learning practitioner has his own set of algorithms and if the candidate’s picks match those of the interviewer, he’ll definitely get his sympathy. But the goal of the question is really the “why” part. Whatever the candidates’ favorite algorithms are, they should be able to justify their choices convincingly. This question can also allow the candidate to show actual passion and enthusiasm about the field, something I believe crucial for the successful data scientist.

Why is feature selection an important step in modeling and what’s your favorite method of doing it?

This is kind of a trick question (at least coming from me) since I don’t really believe that feature selection is all that important. Not in most cases anyway. But it’s treated heavily in literature, and I would love to see that the candidate is not just doing things in a certain way because it’s how other people usually do it. Anyway, even if the candidate does believe in the importance of feature selection, the way he would go about it and whether he understands it’s costs would tell a lot about his caliber.

How do you go about tuning algorithm specific hyper-parameters?

What I’m looking for here is basically any method smarter than the mindless grid-search.

How do you know that your model is over-fitting and what do you do about it?

Simple. Straight-forward. Still an essential question.

Metrics and experimentation:

You inherited a patch of land from your uncle. The first year under your management, land yield goes down to half what it was the prior year, you investigate and find out that you uncle had a secret recipe that he didn’t pass on. There are three possible types of seeds, four types of fertilizers, and two types of pesticide. How would you go about re-discovering you late uncle’s formula?

Well, … randomized experiments with small land patches assigned randomly to treatments is a good start, including treatments that the lack pesticide and fertilizer, assessing main effects and interactions, getting confidence intervals and possibly comparing finalist treatments in a subsequent round (depending on statistical significance of results), … something along these lines.

What kind of metrics would you track for you music streaming website?

No single good answer to this question of course but I’d be looking to assess candidate’s grasp on metrics and their importance and the fact that most metrics have blind spots and how to combine several metrics into one “success” metric and the drawbacks of doing that, and why it might be a good idea to change that metric every now and then, and so on and so forth.

If you were training a classifier, which metrics would you use for model selection and why?

How many time have I seen slides filled with precision/recall numbers that were completely useless for comparing models?! For this question I expect either a metric that compares classifier efficacy along the whole score range like area under ROC curve, or at least comparing recall at a preset precision point or something equally sensible.

You get a weekly spam message predicting the outcome of one football game each week, the spammer claims he has insider information and will let you in on it for a significant fee. You ignore it of course, but you keep getting the weekly message and it keeps guessing the game outcome correctly for 10 weeks in a row, should you pay him? What’s going on here?

This list is by no means exhaustive, in fact I left whole areas and skills totally un-covered (esp. if I believe the typical programming interview covers it). So I’d love to hear some suggestions to expand this list and make it more rounded.

(read the second part of this article)