What makes a great data scientist?

David Kell
Gyana Limited
6 min readMay 19, 2017

--

The “data scientist” is ranked as the best job in the world. And we know why. Almost every large company will need a data strategy to survive the next ten years. They are competing for an elite pool of talent.

How did it come to this? And is there a solution in sight?

How scientists decided the fate of businesses.

Data scientists are scientists that do science with a company’s data. And just like all scientists, they are a combination of two parts.

The first part of the scientist is intuitive. Almost all scientific discoveries are driven by soft qualities like discipline, creativity, perserverance, vision, and gut instinct. Most scientists are driven by hunches. This is the skill business already have. They are domain experts on all aspects with killer gut instincts. They are called strategists, operations leads, marketing heads.

The second part is technical. You prove the hunch with the data.

A company may have this to an extent, but scientists (typically) go much deeper. They have the quantitative skills to analyse large data-sets and find the patterns in that data. We are talking the human genome, the LHC at CERN or the brain. These skills were born in academia. In a typical company, nobody has those skills.

That’s why businesses need to hire a data scientist.

And they cannot just hire someone who has the second part, like a statistician. Unless they understand the mission of the business, and they understand what really matters to the success of the business, and they have the right intuitions about what to look for, they will not make any headway. Like in “real” science, data science is driven by intuition, or the first part. That’s more important than the second part. Businesses survived long before data scientists came around. Even without “big data” methods, they had incredible insight into their consumers and their market.

That’s why a data scientist is such an elite profession. They need to have the intuition and domain expertise (first part), but also the technical skills to analyse data (second part). Left brain and right brain. And good communication skills wouldn’t go amiss. In truth, this is very demanding and it is rare to find a data scientist who scores very high on all of these. It’s typically a trade off.

This leaves business in a strange place. They have very talented people with intuitions. But they cannot access the data directly, to see if they are true. They need to explain them to the data scientist, who then translates them into technical analysis, who then translates back into natural language, who then explains back to the original person.

That’s quite wasteful. It does make you wonder…

What are so special about these technical skills?

Data scientists are elite talent because they bring the technical skills to business intelligence. Once you understand that, you really want to question what is so important about these skills (and whether we can get around them).

Broadly speaking, a good data scientist needs two technical skills.

  1. data munging”: gathering all the data from your organisation into one structured source.
  2. data science”: finding statistical insights in the data, and potentially building predictive algorithms with techniques like machine learning.

Which do you think is easier to automate?

Surprisingly, not the data science part. For many standard data science problems, we already have systems that reliably beat data scientists, like auto-sklearn. Out of the box solutions for recommender systems, time series analysis exist and are sold as commercial solutions.

The real struggle for many organisations is getting the data in a form that allows data science to happen. It is well known that data scientists typically spend more than 50% of their time on this “data munging” process. They complain about it (a lot), and the perception is that it’s a waste of a business’s precious resource.

I disagree — it’s probably the most valuable thing they do. There is a reason we cannot automate this process away. It’s not just because there are too many data formats, too many errors and data quality issues. It is because there is no information on what the data actually means. You need to have domain knowledge (i.e. what the data respresents) to know whether data has quality issues, and to know how to integrate it into your system, and to work out how to combine multiple data sets. You need to be a human being.

Data munging is data structuring. Structure describes how your data is connected together. Data with structure has meaning. In technical terms, this meaning is called “ontology”. It connects the meaningless numbers with real things in your organisation: your customers, your KPIs, your SKUs. There’s a strong parallel with what happens in software — any good software engineer will tell you that great software is built from great data structures, and when you get the data structure of your system just right, the program practically writes itself.

There is a new job for data structuring. It is called “data engineer”. They are similar to software engineers, but their skills specifically focus on the big data technologies (Hadoop, Spark) and data structures to efficiently combine them.

For legacy enterprise systems with big data (think telecom, credit cards), data engineers can build data processing pipelines to structure the data in a format ready for data science and machine learning.

Modern tech infrastructures are designed from the ground up to support this kind of analysis. They can even leverage out of the box solutions where the data is structured by default, and can plug and play automatically into external systems.

Life beyond the data scientist.

The data scientist is the person who can combine three quite difficult skills:

  1. Intuition driven understanding of the business
  2. Data analysis: using techniques to find patterns in data
  3. Data structuring: turning the scattered enterprise data into a data science ready resource

This is a challenging set of skills and it’s little wonder the market is over-saturated. And moreover, it seems odd to combine three very different skills in one role. A business would never combine your sales head, tech lead and strategy head in one role. So why these three?

I don’t think this will last. As the role of data science in organisations becomes clear, the data scientist role will fragment into these three clear functions. Each of these will either be done by a dedicated professional, or it will be automated. It depends on the specifics of the company (the more unique your business, the less likely you can automate it — and none of this really applies to tech companies). This will alleviate the data scientist shortage problems.

There are many market verticals with thousands of companies with similar data needs. Take bricks and mortar retail. You have your POS data. Each sale happens in a store, and at a time. Each sale references a customer ID (e.g. a credit card), and that will link sales together. Each sale includes the IDs of products, and these can be linked to categories and sub categories. These fundamentals don’t change between the thousands and hundreds of thousands of small and large retail chains across the globe. Why should every one of these chains and stores hire data scientists to do the same job (structuring their POS and sales data)? Especially they all use a shared set of POS systems and applications which barely push 100–1000 different providers.

You could imagine a plug and play system for all these providers, to structure the data. (That’s skill 3).

And you could imagine that the analyses also come out of the box (since they won’t differ between retail businesses). (Skill 2).

And then- hand this over to the real domain expert. Your customer insights analyst, your business consultants, your decision makers. (Skill 1)

The data scientist skill will survive (though many will become data engineers, as the need for that role emerges). But in business cases where the data analysis is standard (the interpretation is the creative part), there will emerge tools that allow existing non-technical employees to continue work with their excellent domain knowledge while leveraging the power of their data.

I once heard that people in the 1980s thought that we would need to train millions to software engineers to meet business demand.

What happened? We wrote software that abstracted the user away from writing the software themselves. A prime example is Microsoft Excel, but there are others.

The same thing is happening now.

--

--

David Kell
Gyana Limited

Building the future at @gyaanaa. If we connected all the data in the world, we’d know how things really happened.