Where do Data Scientists Come From?
Our previous article in this series on Data Science Titles made the case that there’s no such thing as a data scientist — instead, the phrase “data scientist” has come to represent a number of distinct roles. So in addition to their different skills and job duties, we’d like to know who data scientists are and what backgrounds they come from.
In this article we dig into the resume data of practicing data scientists, and discover that data scientists come from a wide variety of fields of study, levels of education, and prior jobs. We also explore what this data can tell us about the similarities and differences in the roles of data scientists, analysts, engineers, and software and machine learning engineers.
Who are data scientists?
If you ask every data scientist around you what they did before DS, they’re each likely to give you a different answer. Many come from Masters and PhD programs, in fields ranging from astrophysics to zoology. Others come from the many new data science graduate programs that universities now offer. And still others came from other technology roles, such as software engineering or data analysis.
At Indeed, we help people get jobs. One way we do this is by letting jobseekers submit resumes so employers can find a perfect match. There are tens of thousands of resumes in our dataset from current and former data scientists. We can use this resume data to gain some insight into where data scientists come from.
Does Educational Background Matter?
Highest degree achieved
First, we took a look at the highest degree achieved by those who hold the title of “data scientist” or a related field¹.
We’ve chosen the job titles of data engineer, data analyst, software engineer, machine learning engineer, and data scientist², as these reflect some of the distinct roles we found in our previous articles.
Data scientists have the highest average education level of any of the job titles we examined.
- Data scientists have more PhDs than any of the other job titles. However, a PhD is not required for becoming a data scientist; only 20% of data scientists have them.
- Advanced degrees (MA or PhD) are held by 75% of data scientists.
- Less than 5% of data scientists have only a high school diploma or associates degree.
Machine Learning, Data, and Software Engineers
Software and data engineers have more bachelor’s degrees than advanced degrees, while machine learning engineers are more likely to hold advanced degrees.
- Machine learning engineers have a similar distribution of education levels to data scientists, but are about 30% less likely to hold a PhD. These results seem roughly in line with a similar study by Stitch Data.
- Engineering-focused roles tend to favor bachelor’s degrees with some masters, but very few (<5%) PhDs.
- 1 in 4 Data engineers have HS diplomas and associates degrees as their highest level of education.
Data analysts have a very different distribution of degrees than data scientists, and more closely resemble software engineers in their levels of academic achievement³.
- Data scientists have PhDs at almost 10 times the rate of data analysts, and are twice as likely to hold a graduate degree.
- As we’ll see later, this may be due in part to an emerging pattern of software engineers transitioning into data analysis.
- This could also mean that PhDs are being treated as relevant work experience by employers, who may be seeing data scientists as having more senior roles. Or perhaps the training one receives in a masters or PhD program uniquely prepares individuals for research-oriented data science work.
Field of study
Looking at the distribution of fields of study between job titles reveals some intriguing results.
The “data scientist” job title exhibits the most diversity in field of study of any of the titles we looked at, and no one field seems to dominate. We can quantify the diversity by calculating the gini impurity of each job title.
Gini Impurity (Larger means more diverse fields of study)
- Data Scientist — 85%
- Machine Learning Engineer — 73%
- Software Engineer — 53%
- Data Analyst — 78%
- Data Engineer — 79%
Data scientists clearly have the most diverse fields-of-study in the job titles we’ve looked at, while software engineers have the least diverse educational backgrounds. While the social sciences are somewhat under-represented in the data science population, they still make up about 5% of data scientists. Data science majors make up a slightly larger portion of data scientists (9%), which is somewhat surprising given how new most university data science programs are.
Machine Learning Engineers
Our data also shows a pronounced distinction between data scientists and machine learning engineers. Over 60% of machine learning engineers come from a computer science or engineering background, and are almost twice as likely to be from these backgrounds than someone holding the title of “data scientist.” There were effectively no social scientists with the title of “machine learning engineer” in our sample.
Software engineers are — unsurprisingly — even more heavily focused on computer science and engineering majors. It’s been proposed that machine learning engineers are a merger between software engineers and data scientists. Our data appears to support this assertion.
Like data scientists, data analysts seem to come from a diverse educational background. They differ from data scientists in that they are more often business, economics, and social science majors, and less often have mathematics, statistics, and natural science degrees. It’s also interesting to note that those with data science degrees represent more of the data scientist population than the analyst population.
Data engineers show a field of study distribution that is somewhere between data scientists and machine learning engineers. However, as noted above, many data engineers don’t have any degree beyond a high school diploma!
Which jobs do data scientists hold prior to data science?
Unsurprisingly, many individuals (approximately 25% of our sample) held the same title in their previous role as their current.
This is especially true of software engineers, who are very likely (71%) to have held a software engineering role previously. This is probably due to the relative maturity of the field of software engineering as opposed to data science, which didn’t even have its own title until fairly recently.
“Academic” here means actually being employed by a university, or as a researcher in an academic environment. Graduate students in particular are likely to have held such positions, and we see that the most graduate-degree heavy fields (data science, machine learning engineer, data analyst) have the most transitions from academia.
Perhaps more interesting question is, what was the last different job title that data scientists held?
Here we see some interesting patterns: data scientists, machine learning engineers, and software engineers are more likely to start straight out of academia. Many of the “other” previous jobs are unrelated, such as catering, tutoring, store clerks, and other positions people can often hold while completing their degrees.
Many roles transition into data scientists or machine learning engineers, but rarely do we see data scientists and machine learning engineers transitioning into any of the other roles. This is likely due in part to the relative sizes of the fields, the infancy of the “data scientist” and “machine learning engineer” titles, and the recent growth in popularity of those titles. However, I believe we are also observing an interesting phenomena that speaks to how individuals are moving between and progressing⁶ through each role.
This chord diagram illustrates the main transitions we see between these roles. The color of the chord indicates which role people are transitioning from.
Software engineers make up a big slice of the pie. Many transition to analyst roles, while others hop straight to data science.
Data science is equally fed by academia, analysts, and software engineers. Software engineers are far more likely to hop into a data analyst role, although this is in part due to the larger number of analyst roles than data scientist roles.
Again, we see few individuals leaving data science at this moment. It’s unclear if this pattern will change in the future. The key takeaway here is that the data science field is fed by a wide variety of backgrounds, and it is relatively common to see software engineers become data analysts, and data analysts to become data scientists. This may represent a viable path for anyone looking to transition out of a software engineering role.
Transitions into data engineering come almost exclusively from software engineering⁴.
Where do data scientists come from? Everywhere! Although the field is predominantly populated by individuals with MAs and PhDs, there are still plenty of individuals with bachelor degrees (26%) in the role. No field of study seems to dominate data science at this time; conversely, we see a great diversity in backgrounds for data scientists, especially compared to fields like software engineering. In addition, we see a large number of individuals moving from other tech roles — such as software engineering and data analytics — into data science.
While machine learning engineers reflect data scientists in their levels of academic achievement, they seem to be more heavily focused in engineering backgrounds, and are more likely to have transitioned from a software engineer role. Data engineers also have more of an engineering focus, but tend to have lower levels of degree achievement when compared to the other roles in this study.
What does this mean for data science job seekers?
Graduate school is still the dominant way data scientists get into the field. Data science degrees have a growing presence, and now appear to be a somewhat common way to get entry into the field. Any field of study seems viable if one has obtained an advanced degree. If you’re in a graduate program now, there’s almost certainly someone in your field of study working in data science. I suggest you reach out to them and find out how they made the leap!
Software engineers and data analysts seem to transition into data science roles quite regularly, and represent substantial portions of new data scientists. Future jobseekers should consider these routes as well.
What does this mean for employers looking for data scientists?
If you’re looking for a generalist data scientist, don’t throw out a resume just because the field or degree isn’t what you expect. Data scientists are diverse in their education and background. Although most have an advanced degree in some field, there is no one field that dominates the job market.
If you’re having difficulty hiring experienced data scientists or scientists out of academia, consider bringing in individuals from software engineering or data analyst roles, as that is clearly a common pathway to data science.
Also — as we’ll discuss in a later article — make sure you know the role you’re actually hiring for. Do you think need a data scientist, but feel your role is more heavy on engineering? Consider introducing a “machine learning engineer” role. Do you think you need a data scientist, but with more focus on a business background? Consider hiring an analyst. Do you need someone with a focus on database and infrastructure skills? Consider a data engineer, and don’t focus as much on their educational background.
Finally, if you think you do need some sort of generalist data scientists for your team, consider looking for a variety of educational backgrounds. At Indeed, the members of our data science and product science teams span a wide range of fields, including astronomy, sociology, biology, mathematics, economics, and business. Having a diverse data science team — both in demographics and in field of study — is essential for doing great work⁶ ⁷.
¹Note that there is almost certainly a bias here, in that we’re looking at the resumes of job seekers that have already added “data scientist” to their resume. This means we’re going to be looking at individuals who have likely already been in the field for several years, and may not be entirely representative of more recent trends.
²For each job title, we’ve bucketed related job titles as well, e.g. “Senior Data Scientist” will be in the Data Scientist category, and “C++ Programmer” will be in the Software Engineer category.
⁴To be absolutely clear, I do not mean to imply a hierarchy of roles. Many software engineering roles, for example, are far more senior than many data scientist roles. I am simply referring to the directional pattern that seems to be emerging.
⁵Stitch Science did a nice breakdown of data engineering roles, and also noted the major overlap with software engineering.
⁶See also https://press.princeton.edu/titles/8757.html, https://www.mckinsey.com/business-functions/organization/our-insights/why-diversity-matters, http://www.chabris.com/Woolley2010a.pdf for more information on the importance of diversity in the workplace.
⁷It is not my intention to conflate “diversity in field of study” with broader diversity topics. I strongly believe diversity in all dimensions is essential for doing great work and creating a better society, and it will take far more than focusing on degree of study to overcome the overwhelming lack of diversity in tech workers in the US right now. This article from Stitch argues that Data Science does not appear to be doing any better than engineering roles in many aspects of diversity.