De-waffling Data Science 2: What does a waffle-free Data Scientist look like?

Ben Houghton
Data & Waffles
Published in
6 min readJan 31, 2020

In my first Medium post last year, I argued that Data Science can be unified in definition through the fact that it involves using the ‘scientific method’ to analyse data and generate insight. A natural question I have been asked frequently since posting this article is ‘what does a data scientist look like in this context?’ My answer to this is simple (and lazy): A data scientist is someone who does Data Science as defined above. But this doesn’t help when writing job profiles; interviewing candidates and separating your data scientists from other data practitioners. So, in this article, I will discuss some of the core skillsets which a data scientist needs.

Before discussing individual skills, it’s worth noting that the skillset a data scientist needs will massively depend on the organisation, and in particular, a data scientist in an established data science team will have a much more prescriptive skillset than one in a fledgling team. In the former case, bringing in data scientists who can use the tools which are available and already used in the team will be essential if they are to deliver from day one. When I hire for my team, the priority will always be finding data scientists who can use our tool stack (Python, Spark, AWS) with bonus points given to candidates who can push us in different directions. For the data scientist in a newly formed team, there may be much more of a priority to set the tooling and use-case direction, allowing much more flexibility in initial skill set.

Machine Learning

It’s worth tackling the topic of machine learning first as it’s often pride of place in data scientist job profiles. Although the definition of data science doesn’t explicitly mention machine learning, the chances are it will be an essential skillset.

In the world of Machine Learning, the scientific method distills to building features based off the hypotheses generated and effectively using machine learning as a means of testing them. In fact, Machine Learning is one of the most efficient and effective ways of employing the scientific method when you have hugely complex data, thus machine learning must be a core skill in a Data Scientist’s profile. I talk in detail about the importance of the scientific method in the world of machine learning in a separate article.

A common Data Science interview might involve reciting through some lengthy explanations of how particular algorithms work. This has its place for sure — optimising a machine learning algorithm and understanding your results is significantly easier when you understand what is happening under the hood. I would argue, however, that when most Data Scientists are optimising their models in practice, they would generally have Google to hand, so in my view, checking that a Data Scientist understands the more general mathematics behind Machine Learning (e.g. the basics of linear algebra, calculus and probability theory) is more important than delving into specific algorithms if you have confidence that the Data Scientist will be able to run the appropriate level of research into the method when needed.

So machine learning will indeed be a necessary toolset, however there are many other equally (or even more) important skills which a Data Science may require…

Statistics

An understanding of statistics is, in my view, the core skillset of a data scientist, and indeed anyone whose job it is to glean insight from data. The ability to rigorously model your dataset and fully understand its structure can be the most valuable piece of artillery in Data Science projects.

Not all data scientists need to have a full theoretical underpinning of every element of statistics (although this certainly helps), but having such a knowledge of the fundamentals will be hugely important. For example, here are some areas where some fundamental statistical skillset will be important:

  1. Hypothesis testing is implied to be a core skill as per the definition I gave earlier. This is most naturally a statistical exercise.
  2. The Exploratory Data Analysis process: This is often the most important and longest part of a Data Science project. An understanding of how to do this in a rigorous fashion with carefully applied assumptions will add a lot of value to this.
  3. The feature engineering / data curation process: Deeply understanding the structure of your data can allow you to ensure that you have a concise, meaningful feature set void of any undesirable properties such as serial correlation and inconsistent data.
  4. Analysing regression models, or any other machine learning models, and understanding if they have been successful. For example, working out if your ML accuracy could in fact be high due to chance.
  5. Statistical skill will also help you to build simpler models in situations which look like huge complexity will be required. Time series modelling is a great example of this — the world of statistics provides a variety of methods for modelling complex time series, some of which are showing to be more accurate than ML approaches in some experiments.

Programming

Programming is often a core element of a Data Scientist’s resume, and there is good reason for this — you cannot build machine learning algorithms or even run statistical models without the ability to manipulate data confidently, and this can only be done through programming. This poses two questions: what level of programming expertise is required, and, in which programming languages?

The question of programming languages is quite straightforward, as most established data science teams will probably have a white list of languages they use, likely to include python or R. Incoming data scientists would need to understand these in order to collaborate.

It is tougher to say what level of programming expertise is required. At its core, data science requires sufficient skill to manipulate data and generate hypothesis-driven insights. However, this is no small ask — the menu of techniques which data scientists may need to employ to best make use of their data can range from standard SQL-style manipulation to some hard-core algorithm development work, which often comes with speed and memory usage requirements alongside which the programmer needs to take into account . In particular, when working with “Big Data” or “unstructured data” (text, images, audio etc.), simply running a few SQL queries against a database won’t hit the mark.

It is sometimes argued that data engineers should do this style of work, however data scientists will be the ones ultimately using the data for model development, so they should certainly have a lead role in this.

In addition, there are situations where a data scientist may need to do non-data science work, if they understand the model best or there is no-one else with a more suited skillset. These include tasks such as dev-ops, model deployment, platform engineering and even software development and integration. This is particular the case in small organisations, where a data scientist might need a broader computer science understanding.

Platforms

This is becoming increasingly asked for in job Data Science job-specs and rightly so — if a Data Scientist is going to do most of their work on a Data Science platform (be that a cloud based tool or on-premise), they should know how to use it. These platforms often come with their own limitations and quirks, which can be surprisingly difficult to get used to.

I would argue that a Data Scientist who is adept in the programming and statistics concepts listed above would be able to adapt to a new working environment relatively quickly, so perhaps looking for this in a job interview may be less worthwhile.

Business Domain knowledge

Data Scientists can easily be perceived to be working in silos, separated from the rest of the organisation they work in. This can be very harmful when running projects, particularly when the scientific method is a central part. Hypotheses are often best created by, or in close collaboration with, the business stakeholders and the users of the eventual end product. A data scientist with strong domain knowledge can play an active part in this hypothesis generation process and will be able to tailor the solution to the problem at hand.

In summary, the range of skills required to do even the simple-sounding task of hypothesis driven model development is huge, and so it is certainly justifiable that Data Scientists often have the label of an all-purpose unicorn . However, a data scientist doesn’t need to be (and obviously can’t be) omniscient — when hiring your all-purpose unicorn, some prioritisation of skill sets will need to take place and this can be done through relating to your team’s purpose, size and ambitions.

--

--

Ben Houghton
Data & Waffles

Principal Data Scientist for Analytical Innovation at Quantexa