Models for integrating data science teams within organizations
A comparative analysis
Beginning in the first decade of the 21st century, internet companies were able to gain visibility into the business in ways never possible in the age of spreadsheets and relational database management systems. No longer did they need to wait for end-of-quarter financial results in order to gauge product performance; and no longer did they need to rely on extrapolations from samples to get a comprehensive view of what was working for all customers. In addition to improved visibility into the state of the business, the new data storage and aggregation capabilities enabled companies to build data products like search, ranking, and recommender systems.
What became important was to determine how this work could be achieved efficiently and effectively. Designing and building a data science team is a complex problem; so is determining the nature of interactions between data scientists and the rest of the organization.
A DS team isn’t just the people, it is the process and the interaction of the team with the rest of the company.
In this post, I compare some of the popular models of integrating data science teams within organizations. In determining the best model, I take into account the following factors:
Coordination efficiency. Every team creates new sources of knowledge. Incorporating that knowledge into the business in a timely fashion requires robust organization design. Bad designs lead to failures and inefficiencies in knowledge sharing and coordination.
The goal of work is some output—a strategy, product, marketing plan, budget, account plan, sale, feature, etc. Communication is a way of incorporating stakeholders into a plan *before* it is too far along to change or the cost is too high (or coworkers too angry!)
Steven Sinofsky on Twitter
Management success. Different structures have differing management needs. These requirements could be met by either growing the size of the management team, or by diversifying the skills and responsibilities of the existing management team. Failing to identify management needs leads to headaches for managers and employees alike.
Employee happiness. No discussion of organizational structure is complete without considering employee happiness, motivation, and growth factors. This is not just about reducing the cost of recruiting in response to employee churn, but also about providing employees with the space to do creative and quality work during their tenure. Designing structures without considering employee happiness is a costly failure.
Product success. Data scientists opportunity size new ideas, design experiments and metrics, and design and tune models. They promote the correct use of data within the organization. New products shipped without these considerations usually contain deficiencies in instrumentation and implementation, and are potentially not aligned with company strategy and pressing customer needs.
I make a number of assumptions. The organization is a single strategic business unit (SBU). The SBU is partitioned, with respect to tasks and responsibilities, into independent levers or products. A product team is the cross-functional team (comprised of engineers, designers, etc.) owning a single product. A data scientist (DS) is an engineer with skills in data processing, analysis, and model building; and data science is their work. A data science team is a team of data scientists and their managers.
The center-of-excellence model
We start with the most centralized of all other models. In the center-of-excellence (CoE) model, also known as the research model, the expectation is that the data science team works independently to identify big bets and build prototypes. Under this model, the data science team is considered to be the company’s innovation arm.
There are some misconceptions that lead organizations to choosing the CoE model for their data science team.
a. PhD graduates are hired to do research. Data science teams hire many PhD and Master’s graduates. The focus of most of these programs is research, and so there is a misconception that graduates of these programs are hired to do research. However, the true motivation for hiring PhD and Master’s graduates into DS roles is different. Data scientists are usually required to do engineering, statistical analysis, and model building. The extra years of studies in mathematics and statistics are meant to provide value in producing quality analyses and in designing of new methodology.
b. Innovation happens in a lab. In cases where organizations rightly expect innovation from the data science team, there is a misconception that the innovation arm of the organization needs to be freed and independent of the day-to-day requirements of the business. When teams do not consider the company’s existing business model and infrastructure, their output does not translate into functioning products. This is why despite some historical success stories, many companies refrain from such an investment even when it is affordable. The question then arises, “who is in charge of innovation?” That will need to be the topic of a future post.
There are important drawbacks to having the data science team operate within the CoE model:
a. Lack of context about the challenges of the business. Without visibility into the day-to-day decision-making challenges, purely centralized data science teams find it difficult to identify the most important problems to tackle. They focus on pie in the sky ideas while the product suffers.
b. Difficulty in closing the loop. In cases where they are successful at identifying and solving an important problem, centralized research teams find it difficult to get the solution adopted by the product teams. The adoption of the proposed solution would likely disrupt a team’s existing roadmap—as the two teams are out of sync. Resolving this conflict usually requires actions by higher management, leading to unwelcome interruptions to existing teams and their roadmaps. If higher management does not step in, research teams become demotivated.
My view is everyone is on the same calendar/cadence. That’s a huge thing for me. If you don’t have that then split resources (all of them) by cadence. Teams on difference cadences can’t collaborate.
Steven Sinofsky on Twitter
c. High cost associated with building new team to back initiative. Rather than disrupting existing roadmaps, an alternative path is to build a new product team to back a proposed solution. This team would have cross-functional membership to work on the proposals by the research team, making it a costly but valid endeavor. Valid, because ideas need to be backed by a complete team in order to be assessed correctly and quickly. It would be inadequate to measure the success of an idea if any part of the experience is lacking. A new feature requires design, engineering, instrumentation and measurement, marketing, comms, and sales to realize its potential.
The cost of building a brand new product team further increases if the new product team does not form a partition along with existing product teams. Forming a partition with other product teams is important, otherwise roadmaps and responsibilities would be overlapping. This puts the new team’s longevity at jeopardy as they try to figure out their raison d’être—while confusing other teams.
d. Non-recurring and nondeterministic output. Under the research model, the product teams might be able to adopt and find value in a single output from the data science team, but wonder if there would be follow-through if they were to go ahead and make the proposed changes.
Benefits and success scenarios
It should be noted that the CoE model works for many types of teams. Centralization helps focus and agency. You should centralize that which you can clearly encapsulate from the rest of the organization. Centralization works when coupling is low and joint meetings are few and far between.
As an example, consider tooling development teams. Once the company decides on a technology or programming language, tooling improvement efforts can happen more or less independently of product launch timelines.
Another example of a successful CoE team is Microsoft Research, a subsidiary of Microsoft. Formed in 1991, there is no expectation that the institute produce any result that would be applicable to core Microsoft products. It turns out that Microsoft is leading the patent race in AI as a result of its investment in a research institute.
In the accounting model, also known as the BI model, the data science team produces reports and presentations on a recurring basis (usually monthly and quarterly). The data science team would inform the organization of notable movements in top-level metrics. Once the team identifies an interesting or worrying trend, they would work with product teams to investigate the root cause. Thus, quite frequently, playing detective becomes a main activity of the data science team under the accounting model.
There are three main drawbacks to this model:
a. Difficulty in attribution and closing the loop. As mentioned above, it is near impossible to reason based on global trends. This drawback becomes particularly pronounced when there are many product teams and hence many moving parts and levers.
b. Reorganization and the emergence of tiger teams. It is important to have analyses and metrics which are tied to levers (product teams) so that they are actionable quickly and with less cost and reorganization needs. Reorganization happens and new “tiger” teams emerge when the data science team is unable to identify the culprit and existing product teams are unable to own and prioritize a fix.
Tiger teams rarely form a partition with existing product teams and thus disrupt the flow of the organization. The emergence of tiger teams is a drawback of all fully centralized models.
c. Underutilizing technology. Having monthly and quarterly reports be the only function of the data science team is failing to fully gauge product quality before reaching certain calendar milestones. If launches are leading to less usage in a certain market, the drop happens a launch at a time, not a quarter or even a week at a time. A product opens up to misuse a launch at a time. Data security is breached a launch at a time. Identifying the launch that led to decreased usage in Japan after many launches is an impossible task; so is determining the launch that created incorrect incentives for abusive behavior.
c. Low quality and stale data. Every launch creates new sources of data that need to be incorporated back into existing metrics and considered in future analyses. Accountant data scientists miss all important updates, and usually rely on stale data for analyses. It is difficult to be involved in instrumentation from the sidelines. This is a drawback of all fully centralized models.
Reporting on quarterly trends of company metrics is valuable practice. The centralized aspect of the BI team allows for a holistic view of the SBU, thereby leading to decisions leading to global optimizations that can balance and correct local decisions. This work is something that the data science team should be tackling as their charter, regardless of the model under practice.
The consultant model
In the consultant model, the data science team is assigned tickets or emailed with questions. Data science managers then prioritize the tickets and questions and assign them to data scientists.
In this model, the data science manager overrides any existing data science roadmaps to prioritize the questions and needs of stakeholders. Due to the symmetrical treatment of all members of the team, this model makes managing a data science team easy and cheap.
There are many drawbacks with this model:
a. Communications overhead. Data scientists in a consulting position usually lack the context to resolve questions effectively in a timely manner. There is communications overhead involved in gaining familiarity with data sources and their creation process. Further, if a follow-up to an analysis is needed and the original consultant data scientist has other ongoing commitments, the work will get assigned to another data scientist. This requires yet another onboarding investment—and thus the cycle continues.
b. Unclear deadlines. It is difficult for stakeholders to know when work would get prioritized and assigned to a consultant data scientist. Processes affecting the volume of incoming requests are not transparent to the data science team and their managers. Even after work gets assigned and prioritized, it is difficult for the data scientist to be able to estimate the amount of time needed to answer questions due to their unfamiliarity with the limitations and nuances of the ever-changing data sources.
c. Short-term ownership. Innovation happens when people plan for years, not days and weeks. Having data scientists act as short-term consultants makes it difficult to incentivize focus on complex or tedious work. This work is needed to ensure quality data, quality experimentation tools, quality data manipulation and visualization tools, and quality results.
d. Unclear ownership. A by-product of short-term ownership is unclear ownership. When projects are one-off and seemingly random, people are more likely to step on each other’s toes. This happens inadvertently but is a non-negligible source of inefficiency. It should be noted that this is a drawback of all fully centralized models.
e. Lack of motivation and unfulfilling work. Data scientists working under this model usually lack motivation as they are rarely involved in the product decision making process. They also usually find the work unfulfilling as they rarely see the results and impact of their work.
f. Low data quality and recurring emergencies. Without maintaining good data practices, products that rely on data as input fall prey to recurring bugs and emergencies. In this model, data scientists are pulled into a project in order to play detective and identify the source of the bug.
Apart from the unfamiliarity of the data scientist with the data creation process and the product’s evolution, there is also the possibility of missing data due to missing instrumentation. It is impossible to find a needle in a haystack when the needle is not instrumented. It is also painful to look for a needle in a haystack that is extended to the fourth dimension (of time).
Finding the culprit is nearly impossible in these situations. As discussed in the drawbacks of the accounting model, the organization usually responds by forming a tiger team, thus inducing further disruption and cost.
g. Unclear coverage of product areas. There are many allocation and prioritization challenges under this model. How does work get prioritized by the data science manager? Which product teams get the most attention? Which decisions are made with data in mind, which are made without, and who would be making these calls?
h. No clear sizing and allocation strategy. As with any fully centralized model, it is always difficult to determine the number of data scientists needed. Does the size of the team grow with the size of the organization or with the number of requests? If the latter, how does one estimate the number of the requests and total scope? There is no simple strategy for determining the size of a centralized team.
The embedded model
In this model, product teams hire their own data scientists. Each engineering manager is in charge of planning for data scientist headcount, hiring, and allocation. The data scientist within each product team has the engineering team members as their peers.
This model brings welcome independence to the teams and relieves the SBU of the management requirements of a centralized data science team. It solves problems with team sizing and communications. It also solves the ownership and motivation issues that exist in fully decentralized models.
While there are reductions in data science management cost, this model has important drawbacks:
a. Management complexity. Title and role diversity on the same team lead to management headaches. It is difficult for a single manager to maintain and assess multiple career ladders for different members of the team; mangers rarely get it right with just a single ladder. Usually, the manager is inadvertently biased towards assessment against a single ladder—that of the standard engineering ladder. This incentivizes the data scientist to take on a role symmetrical to other engineers on the team, undermining the original point of hiring a data scientist. Hiring data scientists and putting together the right interview panel is also a challenge within this model.
b. Mentorship deficit and difficulty in maintaining uniform data standards and best practices. Data scientists benefit and learn from working closely with their peers, in particular during analysis reviews. An embedded model does not readily offer a path to a recurring and persistent relationship amongst data scientists. Further, independent data scientists on each team would design their own processes and standards. It should be noted that weak standards is a drawback of all fully decentralized models.
c. Underutilizing technology and data science de-prioritization. Some teams might put off hiring a data scientist due to pressing deadlines and cost. In the absence of good data, services are still deployed. This leads to important shortcomings in data quality and data products that becomes cumbersome to fix later.
d. Local rather than global optimization . When there is no central ownership over metrics and key results, teams choose metrics and projects that lead to local optimization. Further, in this model, teams are incentivized to compete and ignore cannibalization effects. Local optimization is a drawback of all fully distributed models.
The democratic model
In this model, it is believed that easy and straightforward access to data by product managers, designers, engineering managers, and engineers would lessen or remove the need for a data science role. Many identify the need for data scientists to be due to the lack of proper infrastructure for fast and easy dashboard creation.
It is valuable to invest in data infrastructure and tooling that makes data access, processing, and visualization available to everyone. This investment is particularly valuable to data scientists as it frees up time for proactive opportunity sizing, experiment design, metric design, model design, and general improvements in methodology.
While ensuring everyone has direct and easy access to data is a noble goal, there are some drawbacks to this model:
a. Difficulty in mastering everything and maintaining data best practices. Usually, people are mostly specialized and interested in a particular set of tasks. Being skilled at a company’s engineering stack is already a big feat. It is fine to offload design work and product work and data work. Data scientists enforce good data practices within the organization.
b. Dashboards are not data science.
The laws of shitty dashboards * Attack with Numbers
Disclosure: I have been responsible for building shitty dashboards. I personally made most of the errors below. I…
The product data science model
Between the extremes of the fully centralized model (the CoE model) and the fully decentralized model (the embedded model), there exists a spectrum of hybrid models that take characteristics from each of the aforementioned models. Taking advantage of the strengths of both models, while actively making up for their deficiencies is what makes hybrid models successful.
The product data science (PDS) model is inspired, only in part, by the matrix structure. Individuals are simultaneously members of the data science function and a product team. Data scientists—although each a member of a product team—report into a central data science team. Thus, unlike the matrix structure, there is unity of command under the PDS model.
a. Clear ownership and actionable insights. One important benefit of the PDS model is clear ownership of projects by the data scientists, due to their membership in the various product teams. Membership in each product team gives data scientists a thorough understanding of that product, its limits, and its potential. This in turn allows a straightforward mapping of analysis to proposals for action. It is difficult to move fast if newly available insight does not map into reasonable and informed actions.
b. Quality data and quality data products. Data scientists close collaborations with a product team improves data quality. Every single launch changes the data, and so it is important to oversee its evolution with careful instrumentation.
c. Standardized data science processes. Data science peers, working on different product teams, come together to establish best practices and onboarding flows within the data science team. They review one another’s code and analyses. They benefit from a unified career ladder, with managers who can assess their impact and can plan for their growth.
d. Global optimization. The direct and recurring collaboration of data science peers from various product teams has other benefits. Due to their collective birds-eye-view of the business, they are able to connect the dots, identify inconsistencies, and optimize globally. This is very similar to the way design teams operate.
e. Planning and allocation clarity. Another benefit of the PDS model is that it simplifies the task of determining the size of the data science team. Once you figure out how to partition the SBU into product teams, and you figure out the number of engineers per product, the allocation of data scientists can easily be determined as a proportion. I explain more here.
No model is perfect and each have their drawbacks. To quote Sinofsky,
Since there is no optimal or perfect organizational structure […] then the most important thing is to know the weaknesses of your structure and to compensate for them.
Below are some drawbacks of the PDS model.
a. Cost. One of the main arguments against the PDS model is the cost of hiring a data scientist for every product team; and the associated cost of a centralized management team. This assessment does not take into account the savings stemming from the increase in data and product quality, and the more effective use of data for the business. Having said that, organizations should do what they can afford. In the beginning everyone is responsible for engineering and design and data needs. As the SBU grows, one can have separate functions handling each task.
b. Recurring conflicts due to lack of power parity. For success on cross-functional teams, all functional leads should have similar amounts of influence and negotiating power. Without power parity, the benefits of cross-functional collaboration are lessened due to recurring conflicts, late delivery, and suboptimal results.
c. Information overload and the data science manager. Gaining the right amount of knowledge about all products supported by the data science team is not straightforward. However, managers need to be informed and curious about the areas under their purview to be able to effectively assess contributions, investments, timelines, and tradeoffs. They also need to be able to continually communicate the contract between the data science and stakeholder teams and be mindful of the team’s portfolio.
The drawbacks of the PDS model have relatively straightforward solutions. In future posts, I will talk about some best practices for making this model work successfully.
Where a single product is under development, I recommend the PDS model as the best in efficiency and effectiveness in leveraging data for the business.
The PDS model is compliant with Grove’s Law,
All large organizations with a common business purpose end up in a hybrid organizational form.
It is also aligned with Hayek’s views on the use of knowledge in society, where he motivates the need for a hybrid approach to organization and decision making since neither end of the spectrum sufficiently meets the speed and correctness requirements of decision making in society.
We cannot expect that this problem will be solved by first communicating all this knowledge to a central board which, after integrating all knowledge, issues its orders. We must solve it by some form of decentralization. But this answers only part of our problem. We need decentralization because only thus can we insure that the knowledge of the particular circumstances of time and place will be promptly used. But the “man on the spot” cannot decide solely on the basis of his limited but intimate knowledge of the facts of his immediate surroundings. There still remains the problem of communicating to him such further information as he needs to fit his decisions into the whole pattern of changes of the larger economic system.
P.S. I had an easier time saying all of this in a Tweet,
Embedded for context, relevance, communication efficiency, and to be in sync; centralized for hiring and promotion purposes, for peer review, and for sharing and maintaining best practices. —@djpardis on Twitter
 Functional versus Unit Organizations by Steven Sinofsky
 Building Data Science Teams by dj patil
 Where should you put your data scientists by Daniel Tunkelang
 How to play well with others by Josh Wills