Data Scientists. Post 1 of 3.

What is one?

A new breed

Data scientists didn’t exist when I was at university. Statisticians, mathematicians, information scientists, computer scientists… they all did. But not data scientists. William S Cleveland (yes the same chap who proved that pie charts are rubbish) is largely credited with first using the term to describe an independent discipline in 2001 in a paper titled ‘Data Science; An Action Plan for Expanding the Technical Areas of the Field of Statistics’. However, the idea of data science as a career didn’t really take off until 2011 (at least according to Google trends and Indeed job trends). When I graduated from my undergraduate degree in 2005, it certainly wasn’t top of my list of potential careers. (I say list, it was more of a receipt from Topshop with “earn some money to pay off crippling student debt” written on the back of it).

So, what is a data scientist ? Do you need one? Where do you get one and what do you do with one?

It’s a matter of definition…

There are probably as many definitions of what a Data Scientist is as there are Data Scientists. Some think that the discipline is intrinsically linked to big data; others position it as primarily concerned with algorithms; and some see it as a new label for the old practice of statistical data analysis.

For me, the data scientist is the modern day Renaissance woman (or man). An individual with a diverse and rounded skill set that includes mathematics and statistics; computer programming; data munging and business acumen. The data scientist is the Swiss army knife of team members who can combine their broad range of technical skills with their understanding of the business in a creative and innovate way to solve real business problems. Their backgrounds are diverse and include physical sciences; mathematics; statistics; economics and computer science amongst others. But what they all have in common is a genuine passion for extracting insight and meaning from data; a thirst for new and innovative technology; a talent for quickly picking up new tools and techniques; and a penchant for problem solving.

Isn’t that just a data analyst?

How does this differ from a data analyst? I believe it’s a spectrum that is defined by structure. At one end of the spectrum, data analysts are more familiar and comfortable with structured data, structured problems and a structured approach. At the other end of the spectrum, data scientists are able to work with a wider variety of data including unstructured sources; are more comfortable with poorly structured problems; and are able to approach them with novel approaches and techniques. Of course, being a spectrum, there are not clearly defined boundaries and delineations between the two disciplines, but rather overlaps, grey areas and ambiguities. (Such is life).

Office politics

Consider a typical day at the office. For a data analyst this might involve transforming some data in a relational database using SQL to create a set of new tables; joining and interrogating these new datasets to answer specific questions; and then building a BI application to surface the results of this analysis. A data scientist, on the other hand, is more likely to spend his or her day writing some python code to scrape unstructured web data, developing a bespoke sentiment analysis algorithm in R to assess how positive or negative elements of the text are; and then feeding this sentiment score into a machine learning model.

Sack the data analysts!

So does this mean that we no longer need data analysts? Can we sack them all and replace them with data scientists? Absolutely not. Despite the statistics about the volume of and value in unstructured data, the fact remains that for the vast majority of organisations, core business data resides in relational databases. Understanding, managing, manipulating and extracting value from this data is key to helping the business operate effectively and make high value decisions. In addition, many of the problems that organisations need to address on a daily basis are structured and well defined and require a data analyst to apply and refine a known approach instead of a data scientist to spend time creating a new one.

Beware the data divas

Data science as a career path has been much hyped over the last few years. In 2012 the Harvard Business Review touted data science as the sexiest job of the twenty first century; and many have expressed concern over the apparent shortage of data science skills, including the Guardian who claimed that data scientists are as rare as unicorns. In addition, much has been made of the high salaries that data scientists can command.

Somewhat predictably, when you describe a job as really cool, well paid and in-demand the result is that a large number of people who don’t really have the requisite skills or experience to brand themselves as a data scientist suddenly start changing their job titles from ‘data analyst’ or ‘statistician’ to (you guessed it) data scientist. Along with the change in job title comes the expectation of earning the salary and acclaim that the hype has lead them to believe that they deserve. I have seen a growing number of people identifying themselves as a data scientist because they want to be one, not because they actually are one. This either takes the form of solid data analysts misrepresenting themselves as data scientists (possibly because they wrongly believe that data analysis is no longer valued); or ropey data analysts who lack core skills in data management and modelling learning some very basic python and believing that they are the next Jeff Hammerbacher.

Having discussed what a data scientist is, the second post will explore whether businesses actually need one. Thanks for reading!