The role of data scientist: A back-of-the-envelope model

Published in

Data Science at Microsoft

11 min readAug 3, 2020

At Microsoft, we have over 2,300 people with “data scientist” in their titles. We all work for the same company in — ostensibly — similar roles, so you might expect that we’d have a shared understanding of the discipline. But if you surveyed us, asking what it means to be a data scientist, you’d get a surprising range of answers.

The people who focus on descriptive and diagnostic analytics would likely prioritize the rigorous application of statistical methods. For them, a good data scientist needs to know how to choose the appropriate significance test, understand its underlying assumptions, and how to structure analyses to ensure the integrity of their results. Others, focused more on predictive and prescriptive analytics, might say that a data scientist needs to be familiar with a range of machine learning (ML) algorithms: clustering, classification, regression, forecasting, ensemble methods, and more. Still others focus primarily on deep learning, and they place more value in architectures than algorithms: FFNNs, RNNs, CNNs, LSTMs, etc.

As data science evolves, these differences become more distinct, making it increasingly difficult for a single title to describe them all. This has practical implications for anyone who’s a data scientist, anyone who aspires to be a data scientist, and anyone — like me — who manages a data science team. How do you plan a career, choose a curriculum, or build a team when there’s no map to guide you?

The challenge isn’t unique to Microsoft, of course. The Harvard Data Science Review explored this same topic in their first issue. Xiao-Li Meng, HDSR’s editor-in-chief, set the stage: “Increasingly, there is a general recognition that because DS [data science] has evolved in such a diverse way, it is unwise to use a list of must-have skills to conceptualize it as a single discipline.”

Meng intentionally avoided a definition of data science, choosing instead to reduce ambiguity about the discipline by explaining what it is not. To understand the role of a data scientist, however, we cannot explain what it is not; we need a definition of data science that’s precise enough to describe the discipline, but general enough to cover the range of roles within it.

The essence of data science

To date, the best definition that I’ve come across was proposed by Jeannette M. Wing, director of the Data Science Institute and professor of computer science at Columbia University: “Data science is the study of extracting value from data” (Columbia University website). Despite its brevity — or perhaps because of it — Wing’s definition seems to capture the essence of data science. I especially appreciate that, in her words, “…the word ‘study’ includes both the art and science that guides any field of scientific pursuit,” because I believe that creativity is one of the most valuable skills of a data scientist, but often the least emphasized, and certainly the most difficult to cultivate.

One challenge with Wing’s definition, however, is that it probably overly generalizes the discipline. We might reasonably conclude that anyone engaged in “the study of extracting value from data” is a data scientist. I doubt that’s what Wing intended — I’m almost certain that it’s not — but the rise in data literacy and the availability of point-and-click analysis tools have made it possible for nearly anyone to do the work of a data scientist.

Indeed, many of the 2,300 data scientists at Microsoft used to call themselves analysts (including me). Today, however, we have only about 300 analysts and just two statisticians! Are analyst and statistician simply outmoded titles, or have those people responded to industry demand and expanded their skillset?

Elena Tej Grewal explored a similar question in “One Data Science Job Doesn’t Fit All.” In that post, she explains that Airbnb had originally treated analyst and data scientist as separate roles. However, as she writes, “…team members who were doing analytics work felt like their work was not as valued as machine learning work, and yet their work was critical for the business.” In the end, Airbnb decided to define three roles, all within the data science discipline:

Analytics to define and monitor metrics, create data narratives, and build tools
Algorithms to build and interpret algorithms that power data products
Inference to establish causal relations with statistic

Their approach is similar to Google’s, where they’ve defined two broad categories of data scientist: Type A, who analyze data, and Type B, who build production solutions.

Attacking the danger zone

In the absence of consensus opinion, it’s worth revisiting Drew Conway’s original Venn diagram, which helped popularize data science but also shaped many people’s understanding of the discipline, including my own.

I’ve long been a fan of the diagram, because it captures the unlikely combination of skills that are essential to data science. However, I’ve often struggled with the crisp boundaries where the skills overlap.

The most intriguing of these is the intersection between hacking skills and substantive expertise, which Conway labels the danger zone! These are “…people who are perfectly capable of extracting and structuring data, likely related to a field they know quite a bit about, and probably even know enough R to run a linear regression and report the coefficients; but they lack any understanding of what those coefficients mean.”

I understand Conway’s concern: There are “people who ‘know enough to be dangerous,’” and I’ve seen the mistakes that can get made as a consequence. One challenge with the diagram, however, is that it creates the perception that the intersection of hacking skills and substantive expertise — in the absence of math and statistics knowledge — is implicitly dangerous.

As presented, the distinction is based upon a false dichotomy where people either have hacking skills or they don’t, either have substantive expertise or they don’t…. Forgive the hyperbole, but anyone who can add has math skills, and anyone who’s ever used the AVERAGE formula in Excel has both an understanding of statistics and hacking skills. As Conway suggests, such rudimentary skills are hardly sufficient for data science, but it raises a fundamental question: What level is sufficient?

The question of sufficiency is further complicated by the evolving nature of data science. Today’s data scientist is very different from the data scientist of five years ago. Industry demand has led to the increased availability of training options and improved curricula, and advances in technology have led to more sophisticated techniques and made them much easier to implement.

Who does what?

A more constructive approach might be to base a model on the responsibilities of a role, not the qualifications for it.

Consider the typical division of labor between a program manager (PM) and a software development engineer (SDE). For any given project, the PM is generally responsible for identifying the business opportunity, gathering requirements, and specifying the desired solution. The SDE is generally responsible for the designing the technical solution and developing the software.

The chart below models that simple division of labor: The PM focuses on business strategy, and the SDE focuses on technical implementation. These are analogous to Conway’s substantive expertise and hacking skills, so their overlap would put us in the danger zone.

Again, though, the crisp boundaries of the Venn diagram create a false dichotomy. More realistically, both the PM and SDE have responsibilities that cross the line between business and technical in ways that are wholly appropriate. A PM might need to write code for a proof of concept, perhaps to evaluate the viability of an approach or to demonstrate its potential. Likewise, an SDE might need to meet with customers, either to better understand their use-case scenarios or to help them resolve issues.

If we accept that each role is a balance between business and technical, another way to think about the difference between the roles is by the proportion of time that each spends in the two areas.

In the chart below, the y-axis represents a person’s proportion of focus. The x-axis represents the range of ways that a person could divide their time: Ranging from 100 percent focused on business strategy to 100 percent on software development.

The shaded labels beneath the chart show where a PM and an SDE would typically land on the x-axis. Think of them as probability density functions: The darker the region, the more likely it is for a role of that type to land there. A PM, for example, would typically fall on the left side of the chart, perhaps spending 80 percent of their time focused on business strategy and 20 percent on the technical solution.

Depending upon the skills of an individual or the stage of product lifecycle, the balance between business strategy and technical implementation varies for both PM and SDE.

The back of an envelope

To extend the model to data science, we simply introduce the third focus area, math and statistics. The resulting diagram looks a bit like the back of an envelope.

A data scientist role would typically fall somewhere near the middle of the x-axis.

The first thing you might notice is that the model doesn’t allow data scientists to spend 100 percent of their time focused on math and statistics. That’s intentional. While tasks like building ML models characterize the data science discipline, data scientists spend a healthy portion of their time not building models. Much of the rest of their time is spent understanding the business problem, collecting and cleansing data, understanding how to interpret that data from a business context, exploring hypotheses suggested by stakeholders, and more. As such, the center of the model strikes an equal balance among the three focus areas, similar to the Venn diagram.

The model likewise adds math and statistics as focus areas for both PMs and SDEs, because they’d likely spend a portion of their time focused there, too. For example, a PM might analyze trends in the business or build simple a forecast. An SDE might likewise use libraries like Azure Cognitive Services to build bots or image detection solutions.

Other roles

As data science matures, more specialized roles have evolved to address the specific needs of production solutions, including both data engineer and ML engineer. Again, however, we don’t often find bright-line distinctions between these roles and others, but their positions on the continuum give us a sense of where they focus their attention and what skills would be most valuable.

The smaller the team, the less likely that it will have specialists in each of these areas. In the absence of ML engineers, data scientists might need to put their own models into production. In the absence of any technical support, a “full stack” data scientist would be responsible for the entire technical implementation, spanning everything from data acquisition to model deployment and everything in between. Conversely, in the absence of data scientists, PMs need to model their own business.

My own team at Microsoft is fortunate enough to have people who fill most of these roles. (We’ve introduced you to some of them in our series, “The faces of data science.”) Our centralized approach gives us the size to establish roles in the following disciplines: data engineer, data scientist, machine learning scientist, and program manager.

My leads came up with the following descriptions:

Data engineer

Data engineers build the data platform. This includes architecture design, technology choices, development, and maintenance. The platform must bring together a broad set of datasets, store them in a compliant way, and join them across common identities. It also must be able to run machine learning models in production and handle DevOps. Finally, it needs to meet a number of stringent quality bars, including data quality monitoring, anomaly detection, privacy, GDPR, security, reliability, uptime, data contracts, and service-level agreements. Skills of the data engineer include ingesting data, developing (and maintaining) ETL data pipelines, following software engineering practices, employing experience with cloud-based technologies, working with relational databases, and developing big data engines. Our team’s earlier post on our data platform architecture is a good way to learn more about what our data engineers do. It also gives an appreciation for the scale of the data involved as well as the technologies we use.

Data scientist

Data scientists develop a variety of stakeholder-facing deliverables that help the company make data-driven decisions. These range from descriptive analytics to predictive analytics, depending on requirements. They extract meaningful insights from large datasets, and share this context with product, service, and business leaders. Data scientists help design and analyze experiments, leveraging statistical techniques. They also develop AI services to power products, drawing on their programming capabilities. Data scientist skills range from technical (querying, statistics, modeling, analytical problem solving, and data visualization) to business (domain expertise) to soft skills (communication, prioritization, and cross-group collaboration).

Machine learning scientist

Machine learning scientist isn’t an official title at Microsoft, but we use it to identify the people who fill a hybrid role. ML scientists combine the skills of a data scientist and ML engineer, because they develop AI services that run in production. This includes programming analytical services into customer-facing, product functionality. Examples include recommender models that suggest what the customer does next, or cost-management services that help customers forecast spending. It also includes predictive models for internal stakeholders, such as sales and support teams, that suggest how to best engage with customers and help them succeed. ML scientists work to continuously tune and improve their models. First, they optimize the models using training data. Then they regularly retrain the models over time as more current data becomes available. They also incorporate feedback from model users and track model performance. Finally, they help maintain these models in production and respond to support tickets, maintaining service-level agreements. Their skills include machine learning and statistical modeling techniques such as decision trees, logistic regression, probability, deep learning, neural networks, Bayesian analysis, natural language processing, and others.

Program manager

Program managers help scale our stakeholder engagement. They document stakeholder needs and maintain regular stakeholder communication, including answering questions, communicating plans, and providing awareness for deliverables. Given the broad set of requests we receive, program managers drive a divisional planning process that enables stakeholders and our data science team members to prioritize the backlog and ensure we partner on the most significant work. They help us track committed work with tracking tools and ensure stakeholders have a clear plan to leverage the insights we produce to drive clear business value. Finally, program managers help drive end-to-end team projects that span the functions of our multi-disciplinary team. Their skills include communication, organization, ability to influence, and the ability to quickly learn new domains.

What’s next?

As the discipline matures, we can expect new roles to appear and the definitions of existing roles to evolve. To adapt our back-of-the-envelope model, we may need to introduce new focus areas. In fact, I considered adding communication skills and creativity, but I’d contend that they warrant consistent levels of focus, regardless of role, not proportionally balanced with other areas of focus.

Ultimately, to succeed in any data science role, a person needs the complete portfolio of skills. The only difference is the proportion of focus that each requires.