Hiring data scientists (part 1): what to look for in a candidate
The technical and business skills that are critical being a data scientist
There is a shocking lack of material written on how to hire good data scientists. There have been plenty of articles telling you, dear manager, how to hire good employees. They range from fluffy recruitment pieces urging you to hire people with character (whatever that means), to software development hiring guides which are closer but don’t hit the mark. The field of data science has its own problems, like working with messy data and presenting non-technical insights drawn from deeply technical problems. I am going to shine a light on how I hire good data scientists within my own company, from what makes the ideal candidate to how I interview and run case studies.
Before jumping into my methods, it’s important to understand what I am looking for. To do that, you have to know something about who I am and what my team does. I’m the Director of Analytics at Lenati, a marketing and sales strategy consulting firm. We help clients acquire customers, improve loyalty programs, and other marketing and sales related tasks. While most of the people at Lenati are MBA-holding business people, my team contains data scientists who use client data to shape our solution. So for example, if Lenati is working Hilton to revise their customer-based strategy, my team will be involved in taking Hilton’s data and using it to figure out how their customers have historically behaved.
Our projects are generally 3–6 months long and our product is a strategy based on data-driven insights. Because of this, we rarely deliver code or a machine learning model. Instead these are just tools we use to develop the strategy we present to the client. So, like other data science jobs:
- We are constantly in communication with our business stake-holder.
- We are doing a wide array of levels of analysis: from exploratory data analysis to complex simulation.
- The recipient of our work often has little to no understanding of data science.
Unlike other data science jobs, our work is usually a one time only affair — it’s rare that we need to refit our models or make our code long-lasting. We also work with many different companies, so we have to be prepared to receive data in all sorts of formats and with limited explanation of what it means.
The ideal candidate: skills
The ideal candidate has skills in each of three broad buckets: math/statistics, databases/programming, and business. My hiring process is designed around probing the candidate in each area to see where they fall. Here is an extremely derivative to Drew Conway’s Venn diagram diagram I made on it:
Math and statistics
I’m looking for their background in math and stats to show me that they understand the concepts needed to do data science work. This includes basic statistics (example: linear regression — what it is and when it works well), and model building (ex: training versus test data, different models of training like cross validation, what does “boosting” even mean). If the candidate is sufficiently experienced in this area, they should be able to list the different models they’ve used and the different type of problems they’ve worked on.
Though someone with a degree in statistics or data science should pass this portion of assessment with flying colors, someone with a related degree (math, computer science, or economics) may not have the machine learning component of this skill. If they don’t have machine learning knowledge, then having a related degree suggests they can pick the skill on the job. One of my best hires had his Master’s degree in fishing science. During his degree he had to do mathematical modeling, so he picked up R and *boom!* six months later he was a pretty good data scientist (who I then hired).
Databases and programming
In its most basic form, data science is the art of taking existing data and processing it in a meaningful fashion. That means you need to be able to (1) pull the data from a source and (2) process it into an insight. The ideal candidate should have the technical knowledge to do both of these steps. Counter to intuition, the tools to get data and process it are not the same.
To pull the data, the candidate must understand relational databases. Since data is stored in relational databases, and these are queried using SQL, the candidate must know SQL. Sometimes a candidate doesn’t know SQL but conceptually knows how to join tables and aggregate them. If this is the case they can pick up SQL on the job. If they have experience in storing data in other ways like NoSQL then that’s a plus, but not something I expect. Only knowing how to read data from flat files isn’t enough.
Once you have data, it needs to be used. That may mean the complex act of building a machine learning model, but it certainly always means creating a visualization of the material. The candidate needs to be able to do this, which requires writing code. If a candidate knows R, Python, or MATLAB they are good to go on day one. If they know a language more common in software development like Java then they can easily pick up a statistics-focused language. There are GUI-based tools to do this work, but if the candidate has only used GUIs then their skill set is too limited to do the wide array of work I expect from a data scientist.
If they only have experience in Excel then they do not satisfy this requirement. While Excel can create visualizations, that’s pretty much all it can do and it doesn’t even do that quickly. Having only used Excel shows the candidate hasn’t thought about what can be done outside of Excel — or worse, they decided they’d rather stick to only using what they already know.
Someone with a computer science degree satisfies this by definition. A data analyst often only works in Excel and hasn’t connected directly to their data. Someone in business intelligence may be able to query data, but they lack the programming tools to meaningfully manipulate data. Therefore most data analysts and people from business intelligence do not satisfy this criteria.
Let’s face it: the ability to understand a business environment and work within it is just as important of an ability as statistics and programming. The whole idea of data science is using technical skills to create real world, practical insights, so you need to be able to understand how the real world works and what insights people need. The candidate needs to be able to:
- understand a problem a person or department is having within a company,
- translate that into a problem that data science can solve,
- solve it (using their math/stats or databases/programming skills), and
- convert that solution into an insight that someone who doesn’t know anything about data science can use.
75% of those steps are centered around business fundamentals. For example, if a company is having trouble with their promotional emails, the candidate may consider segmenting the email recipients using a clustering algorithm and giving each segment a personalized email. If their approach involved using a k-means algorithm with 7 clusters, the candidate should be able to explain why they chose 7 to someone who doesn’t know what a clustering algorithm is. Depending on their level, they should also understand how often to check in with a client or project stakeholder, be able to ask questions to a data-owner if they don’t understand what’s in the data, and make a well-formatted PowerPoint. Someone with several years of experience working at a company usually satisfies this requirement.
The ideal candidate: character
To steal from Joel’s software development hiring guide, I look for someone who is:
- Smart, and
- Gets things done.
I take “smart” to mean “has the ability to learn new things.” I want to see some evidence that when they run into a situation where they don’t know what to do, they can learn and figure it out: they have a drive for problem resolution. That could mean learning a new language, a new modeling technique, or a new business process. Lots of people spend their whole careers doing only what they already know any avoiding having to learn a new process. These people tend to make poor data scientists, since the whole field is about using data to learn.
I think of “getting things done” as a general desire and capability to create a solution. Data science is filled with places a person can get stuck: there are countless ways to cut a data set, hundreds of different machine learning models each with different parameters to be adjusted, and plenty of ways to report out on results. Someone who gets things done is able to cut through the different options and select one that works, and then actually implement it.
So that’s the perfect data science candidate. They have skills in math and statistics so they know how to work with numbers and build models. They know databases and programming so they can take actual data and actually do something with it. They understand how businesses work enough to find a problem, create a data science solution for it, and then convince non-data scientists that it’s a good one. They need to be smart and get things done. However someone with all of these qualifications don’t just waltz into my office asking for a job opportunity with a salary I can afford. So what do I given the reality that I can’t hire the perfect person? That will be covered in the next parts of this series. Specifically:
If you want a ton of ways to help grow a career in data science, check out the book Emily Robinson and I wrote: Build a Career in Data Science. We walk you through getting the skills you need the be a data scientist, finding your first job, then rising to senior levels.