The 5 Characteristics of a Great Data Scientist

Exceptional data scientists share several key features, which are often overlooked. Focus on improving these things to become truly great.

Elad Cohen
Riskified Tech
8 min readMay 3, 2021

--

Image by Riskified (used with permission)

As in every profession, there are good data scientists — those who meet the expectations, deliver results, and push the business forward — and there are great data scientists. In this article, we’ll cover the most important attributes and characteristics that make these data scientists so great at their job. A deep understanding of the theory, strong engineering skills, and familiarity with up-to-date technologies are the cornerstones of a solid data scientist.

In my opinion, those skills are mandatory prerequisites but aren’t enough to make someone truly great. On the other hand, it’s important to remember that the skills discussed below aren’t enough if you don’t have the technical chops and knowledge to excel in the position — excellence requires both. While you can gain the technical skills through years of experience and dedicated learning, some of the skills listed can be harder to master — especially if you aren’t aware of them and actively working to develop them.

1. Curious to a fault

The tell-tale sign of a good researcher is their curiosity. Just about any subject can interest you, and you get genuinely excited when exploring a new dataset for the first time — seeing the features, uncovering relationships, and making sense of the data. Without an external mechanism (sprint, boss, advisor) to limit the scope of the research, many researchers and data scientists could explore the data, investigate outliers, check where and why their models fail, and continue improving their performance well after the point of diminishing returns. A good data scientist can greatly benefit from an outside party checking in and objectively questioning whether they might have met the 80/20 rule.

Great data scientists check their work often — running sanity checks or other tests that can tie what they’re working on with known properties. For example, if you’re exploring a unique population and see unusual metrics, you might want to run your script on the general population and compare against ground-truth values to weed out possible bugs. Similarly, when you’re checking for relationships between features, you should first ask yourself what relationship you expect to find. Should it be monotonically increasing? Parabolic with a maximum? If so, where would you expect that maximum? Only then check whether the relationships agree with your intuition and explore them in detail if they don’t. Being extremely curious gives you a fighting chance of uncovering Simpson’s paradox when it pops up.

As a great data scientist, you’re curious and might quickly dive into a new dataset, but you’ll always keep in mind what you know and what you don’t. If it’s a new dataset you haven’t worked with before, you won’t make assumptions about the features based on their names, and you’ll take everything with a grain of salt. As the Zen of Python goes, “in the face of ambiguity, refuse the temptation to guess.” You’ll ask others around you questions, and if you do make assumptions about data — you’ll validate them.

Even though you’ve broken into data science years ago, you’re still curious about new algorithms and technologies and continue to learn independently (case in point — you’re reading this blog right now). You enjoy the tremendous breakthroughs the field is currently experiencing.

Finally, you’re curious about the domain you operate in — you want to gain a stronger understanding of the relationships and effects and how they play out in the data. At Riskified, we offer a set of structured courses in our internal Fraud Academy and encourage data scientists to invest their time to study fraud patterns. This same curiosity is what enables a great data scientist to switch to a new domain and quickly dive in and learn.

2. Aware of the degrees of freedom

A great data scientist always remembers the tradeoff between bias and variance. You’ve committed the sin of overfitting enough times in the past that you’re acutely aware of the degrees of freedom being tested. You know how many observations are in your data set, and the imbalance of your data (if it’s a classification problem). You are aware of the relative level of complexity of your algorithm — how many degrees of freedom are being tested and how that compares to the number of observations. If you’re testing multiple models or running hyperparameter tuning — you’re tracking how that impacts the degrees of freedom being used.

You have a strong intuition of linear algebra and know when you might be facing a problem. You logged all the models you’ve tried so far — even if they weren’t successful, they count against you in your p-hacking. Even though you’re optimizing on the training set and there’s no leakage between the training and the testing set, you know that comparing multiple models on the test set will end up in overfitting.

During a past interview, I talked to a candidate who built multiple training sets (each containing several months of data) and used the next months’ data as the test set. He was meticulous about avoiding data leakage between the training and test. The data was extremely imbalanced and the test set only held a few dozen positive cases (vs. hundred of thousands of negative cases). When he compared several models, the winner was decided based on the success of the test set. However, this could easily have been due to random chance, with one model happening to outperform on a handful of observations that didn’t necessarily generalize to the following month. As a great data scientist, you’re very suspicious at point estimates and you aim to provide confidence intervals. You always keep in mind the multiple hypotheses tested in the process when calculating these confidence intervals.

3. Knows their environment — both technical and business

A good data scientist focuses on their research. A great data scientist knows the context of their work — both the business side and the technical (production implementation) constraints. You know what the most important KPIs for the business are, and what your stakeholder really cares about. You can suggest directions that aren’t within the original scope but could provide added value to the business. Additionally, you understand how your output is going to be deployed in production and the constraints and limitations imposed on your model as a result. You won’t spend too much time on an overly complex solution that can’t be implemented (but you would test it just in case the value is wildly higher, which in turn could drive questions around, bolstering the production environment). You’ll start by developing a baseline/benchmark model before working on ‘the real thing.’ By understanding the business environment and knowing what’s needed, every once in a while you may be pleasantly surprised that the baseline model is good enough and you can move on to the next project quickly.

You care about making an impact — not just running high-quality research, but one that delivers value. In many cases, this isn’t under your control — the research might not pan out, or the requirements for the model might not be feasible in production. As a good data scientist, there will be many cases where you aren’t able to make an impact, even though you did a great job (cheer up — hopefully, you’ll get a break next time). A great data scientist can slightly improve the odds by limiting time spent on directions that can’t be implemented, and will cut their losses if a certain research direction doesn’t feel like it’s seeing enough progress.

A good data scientist knows all the best practices. A great data scientist knows that these best practices are guidelines rather than hard truths. You know when to take on more technical debt (keeping in mind how your team will pay it off later), and you have the integrity to stand up and prevent the business from taking expensive shortcuts. You’re flexible and can adapt to the circumstances, whether it’s a small startup or a large corporation, from an applied-research position to advancing the state of the art.

4. Yearns for feedback and is self-critical of their work

A good data scientist is open to feedback from others. A great data scientist actively looks for feedback. As in most professions, most of our development comes from practicing our skill and receiving feedback from other experienced people. Great professionals know that this is a natural way to progress and embrace it. You don’t get offended, nor do you become defensive — you are sincerely interested in hearing what others think. You can take an objective view on your own work, and won’t treat it as your baby. If you don’t receive enough feedback or have your work reviewed carefully, you’ll proactively request this from your peers. You’ll present your work internally when it’s relatively complete — late enough that it has sufficient meat and the entire end-to-end process has been finished, but early enough that the feedback can still make a significant impact on your research while there’s time remaining. You understand that the more pairs of eyes that go over your research and code, the better the output will become. You won’t necessarily accept every idea or concern raised, but you’ll give it careful, respectful consideration.

Gathering feedback from peers is vital to succeed | Picture by Riskified (used with permission)

Similarly, you are self-critical of your own work. You know that there’s a good chance you’ve made mistakes (everyone does). You could have made assumptions others won’t agree with, and there might be better methods or techniques that you didn’t try. You are acutely aware of the caveats behind the project and how they might impact the results. Being your own biggest critic helps you preempt criticism from others. As you are very upfront with your assumptions and caveats, you can mitigate concerns others may have concerning overlooked issues. Time and time again you refrain from being overconfident, and so when you do demonstrate confidence in your work your colleagues and stakeholders trust the results.

5. Communicative and collaborative

A good data scientist works very wells with other data scientists — they can explain their process, the model, the error function, and all the hyperparameters carefully, focusing on being accurate in their explanations. As a great data scientist, you know how to communicate as effectively with non-technical peers — adapting your language and terminology for a better balance of explainability vs. accuracy. That means that you’ll dial back the nuanced technical points and simplify the concepts so that stakeholders with various levels of machine learning knowledge understand your point. You know that consistently mentioning complicated-sounding concepts and terms (or heavens forbid — abbreviations) isn’t as much a sign of knowledge as a sign that the speaker isn’t able to distill the core concept and explain it in simpler terms.

A great data scientist knows how to collaborate with other data scientists just as well as with developers, analysts, engineers, and every other relevant party you may work with. You realize the tradeoff between each person working on what they know best (highest efficiency) vs. going the extra mile and ‘doing what it takes’ (fastest time to market). You collaborate effectively and will adapt your solution when it can help facilitate a quicker implementation. Being highly collaborative also means you’re open to others and personable. You treat others (especially less experienced colleagues) with the utmost respect and regard. You’re genuinely fun to work with, and colleagues flock to work with and learn from you.

You pass it on and you’re an asset to the organization. Come work with me.

I’d love to hear your thoughts — are there other critical characteristics of a great data scientist I didn’t touch on?

--

--