Why Your Data Science Team Fails at Research: Uncertainty and Domain Knowledge
This post is the first in a four-part series on doing research within the context of a Data Science team. Views expressed here come from my personal experience in the field. This first part focuses on the concept of research, and compares it to other functions of a typical data science team. I aim to set the table for the next three parts which present a working solution to the challenges laid out here in terms of Process, Management, and Strategy.
Data Science as a profession has grown exponentially during the past decade. Many firms hire data scientists for a variety of reasons, but chief among them is using their data to find hidden gems that would transform their business. Firms are interested in new insights that can radically and directly translate to competitive advantage. But have Data Science as a profession been successful in delivering on this promise? I am not sure the answer to this question is unanimously positive.
Sure, the teams at the powerhouses of analytics have derived a tremendous amount value and there are unicorns like Uber that are entirely build on Data, but the average team working in a non-IT industries (Healthcare, Manufacturing, Distribution, Food) have not been so hugely successful.
These industries seek Data Science with the objective of finding new insight that could translate to competitive advantage. My hypothesis is that Empirical Research, the task of extracting insight from data, is not so straight forward as all the other tasks assigned to the Data Science team. To be successful in research a separate organizational machinery needs to be in place.
Let me start to explain this hypothesis by first examining the roles and responsibilities of a typical Data Science team. I categorize these into four groups: Development, Maintenance, Consulting, and Research. Let’s look a bit more closely:
Development: The main function of every data science team is to build data products via the application of machine learning. A few examples of these include forecasting models for revenue, market basket analysis models, and recommender systems for upsales. All of these products provide insight about new data by learning patterns in the historic data, and to build them, you team must write code, and often build ancillary domain-specific tools to help along the way. Examples of these tools are web apps that help your team quickly profile data, or command-line tools for training and deploying models. These tasks are all complex, but all in all do not involve a great deal of uncertainty.
There are technical challenges of course, for example choosing the right model for each use case and making sure the data pipelines work without error, but most of these issues can be anticipated well in advance of execution of a project. It is relatively straight forward to estimate how long it would take to create a new model, or build an internal tool to help with that.
Also, development of machine learning models does not require deep familiarity with domain knowledge. Data Science methods deal with arrays of numbers or categories and don’t really care what those numbers represent, so as long as the basic framework is put in place correctly, the prediction algorithms will give you relatively satisfactory results.
Domain knowledge can be very valuable and make a huge difference in being able to create highly accurate models, but it is not a requirement for doing so. Advances in computing power allows data scientists to build extremely complex models (such as deep neural networks) that are able to create very good models using minimally processed raw data.
Maintenance: After developing these machine learning products, the team naturally has to maintain them. Maintenance tasks range from routine re-calibration of models to ensuring compatibility with new data architectures and pipelines. Maintenance tasks involve very little uncertainty, deep expertise in data science, They do require some understanding of domain knowledge, but nothing too deep. Some of these tasks should ideally be automated, and the rest should be passed to Business Analyst or Technician level resources. Data Scientists are expensive, so best use of resources is to keep them busy doing tasks that requires their specific expertise — or the business will suffer.
Consulting: The team is often sought out by internal and external customers on how a particular problem can be solved using Data Science tools. For example, a team in the Healthcare Industry may be asked if it would be possible to use machine learning to codify Electronic Medical Records (EMR), or match hospital charge item codes to CPT codes.
Any competent data scientist can answer these types of questions readily, and will be able to recommend a high-level design for execution with some limited investigation into the requirements. There is some uncertainty involved as it is difficult to say how accurate (and hence useful) a predictive model can be before the model is actually built, but it is easy to estimate this with a quick proof of concept.
Consulting relies somewhat strongly on domain knowledge, but often the knowledge is provided as part of the consulting engagement. Clients usually know the problem they are facing and seek the expertise of data scientists on how to solve that problems. The solution may be technically challenging but chances are it does not rely heavily on a specific domain knowledge. The same algorithm that can recommend movies on your Netflix account based on your viewing history, can be used to recommend an alternate career to a job seeker based on job-posting browsing history.
Research: Finally we get to talk about research! A lot of the attraction towards Data Science is for its promise for uncovering new insight from data. Businesses have many questions they want to answer using their data, they know there is a treasure trove of information in their warehouse, but they don’t know how to extract it. As a result, they have questions that are often investigative in nature and cannot be answered directly by creating predictive models. Here are a few examples:
- Dean of a Law School may ask “Why are our students passing the Bar examination at a lower rate than before?”
- CEO of a game company may ask “What should we do to increase player engagement in our online game?”
- Head of Product at a manufacturing firm asks “Which customers are more likely to buy this completely new category of products that we are introducing next season?”
The process required to answer these questions is inherently different from all the other roles I just described, for two reasons (can you guess?): Domain Knowledge and Uncertainty!
There is a great deal of uncertainty involved with answering these questions. By definition it is impossible to know the answer without investigating the data. Investigating the data can take an uncertain amount of time, because one cannot estimate the number of lines of code required to uncover the answer. Also, it could well be that the answers even do not exist within the data available to the firm/institution. The answers could be somewhere outside the existing data. Here’s one example to illustrate this:
- Here’s why the students at the Law School were failing the Bar exam: four very well respected faculty recently retired. This in turn caused the quality of teaching to suffer in the foundation courses they taught, which in turn caused lower bar exam passage rates. However, the data available to the Data Science only included transcripts and did not include instructor names, so it was impossible to know that a transition had happened in those foundation courses.
Another reason why research could be uncertain is that it most often involves social phenomena, which is extremely hard to characterize. Human behavior is a very complex and uncertain construct, and to make things even harder the data available is usually observational data making it harder to establish causal relationships.
Research also requires a deep understanding of domain knowledge. If not impossible, it is very difficult to tell a data-driven story without knowing what the data represents and how each data item is relevant to the others. I said earlier, that data scientists can build very complex models that perform quite well without any knowledge of what the data represents, but almost of all of these models are Black Box Models, meaning it is impossible to interpret or explain them as the mathematics behind them is complex. Building models that are interpretable (Glass Box Models) is necessary for research and requires understanding the domain very deeply.
To summarize, let’s visualize the discussion using a simple chart:
This chart summarizes our argument: Research is inherently set apart from other categories of work assigned to the Data Science Team, it involves orders of magnitude more uncertainty and requires deep understanding of domain knowledge.
Due to this qualitative difference, Data Science Research (or Empirical Research) should be managed differently. A Data Science team that does a good job at developing machine learning products may not be well positioned to succeed in doing research. The problem with many data science teams out there is that they are trying to do empirical social science research using the same formula that has worked for development of machine learning products. This is where they fail, the management of research. They fail to execute, because they don’t know how to deal with uncertainty and how to acquire domain knowledge.
A Data Science team that does a good job at developing machine learning products may not be well positioned to succeed in conducting research.
In the parts that follow, over the next couple of weeks, I will lay out a working model of how to deal with these complexities. I will try to characterize a working model of Processes, Management Styles, and Strategies required for successful empirical research within a Data Science team.