Models for integrating data science teams within organizations
A comparative analysis
Beginning in the first decade of the 21st century, internet companies were able to gain visibility into the business in ways never possible in the age of spreadsheets and relational database management systems. No longer did they need to wait for end-of-quarter results in order to gauge product performance; and no longer did they need to rely on extrapolations from samples to get a comprehensive view of what was working for all customers. In addition to improved visibility into the state of the business, the new data storage and aggregation capabilities enabled companies to build data products like search, ranking, and recommender systems.
What became important was to determine how this work could be achieved efficiently and effectively. Designing and building a data science team is a complex problem; so is determining the nature of interactions between data scientists and the rest of the organization.
In this post, I compare some of the popular models of integrating data science teams within organizations. In determining the best model, I take into account the following factors:
Coordination efficiency. Every team creates new sources of knowledge. Incorporating that knowledge into the business in a timely fashion requires robust organization design. Bad designs lead to failures and inefficiencies in knowledge sharing and coordination.
Management success. Different structures have differing management needs. These requirements could be met either by growing the size of the management team, or by diversifying the skills and responsibilities of the existing management team. Failing to identify management needs leads to headaches for managers and employees alike.
Employee happiness. No discussion of organizational structure is complete without considering employee happiness, motivation, and growth factors. This is not just about reducing the high cost of recruiting in response to employee churn, but also about providing employees with the space to do creative and quality work during their tenure. Designing structures without considering employee happiness is a costly failure.
Product success. Data scientists opportunity size new ideas, design experiments and metrics, and design and tune models. They implement the correct use of data within the organization. New products shipped without these considerations usually contain deficiencies in instrumentation and implementation, and are potentially not aligned with company strategy and pressing customer needs.
I make a number of assumptions. The organization is a single strategic business unit (SBU). The SBU is partitioned into independent levers or products. A product team is the cross-functional team (comprised of engineers, designers etc.) owning a single product. A data scientist (DS) is an engineer with skills in data processing, analysis, and model building; and data science is their work. A data science team is a team of data scientists and their managers.
The center-of-excellence model
Let’s start with the most centralized of all models. In the center-of-excellence (CoE) model, also known as the research model, the expectation is that the data science team would work independently to identify big bets and build prototypes. Under this model, the data science team is considered to be the company’s innovation arm.
There are some misconceptions that lead organizations to the CoE model for data science teams.
a. PhD graduates are hired to do research. Data science teams hire many PhD and Master’s graduates. The focus of most of these programs is research, and so there is a misconception that graduates of these programs are hired to do research. However, the true motivation for hiring PhD and Master’s graduates into DS roles is different. Data scientists are usually required to do engineering, statistical analysis, and model building. The extra years of studies in mathematics and statistics are meant to provide value in producing quality analyses and in designing of new methodology.
b. Innovation happens in a lab. In cases where organizations rightly expect innovation from the data science team, there is a misconception that the innovation arm of the organization needs to be freed and independent of the day-to-day requirements of the business. Since these teams do not consider the company’s existing business model and infrastructure, their output does not always translate into functioning products. So despite the well-known success stories, many companies refrain from such an investment even when it’s affordable. The question then arises, “who is in charge of innovation?” That will need to be the topic of a future post.
In the context of the unfavorable outcomes I mentioned earlier, there are important drawbacks to having the data science team operate within the CoE model:
a. Lack of context about the challenges of the business. Without visibility into the day-to-day decision-making challenges, purely centralized data science teams find it difficult to identify the most important problems to tackle. They focus on pie in the sky ideas while the product potentially suffers.
b. Difficulty in closing the loop. In cases where the team is successful at identifying and solving an important problem, centralized research teams find it difficult to get the solution adopted by the product teams. Usually, the adoption of the proposed solution would either disrupt a team’s existing roadmap—as the two teams are out of sync. Resolving this conflict usually requires escalation to and actions by higher management, leading to unwelcome disruptions in existing teams. In cases where higher management does not step in, data science teams become demotivated.
c. High costs associated with building new team to back initiative. Rather than disrupting existing roadmaps, an alternative path is to build a new product team. This team would have cross-functional membership to work on the proposals by the research team, making it a costly but valid endeavor. Valid, because ideas need to be backed by a complete team in order to be assessed correctly and quickly. It would be inadequate to measure the success of an idea if any part of the experience is lacking. A new feature requires design, engineering, instrumentation, experimentation, marketing, comms, and sales to realize its potential. The costs of building a brand new product team further increase if the new product team does not form a partition along with existing product teams. Forming a partition with respect to other product teams is important, otherwise, roadmaps and tasks would be overlapping. This puts the team’s longevity at jeopardy as they try to figure out their raison d’être—while confusing other teams.
d. Non-recurring and nondeterministic output. Under the research model, the product teams might be able to adopt and find value in a single output from the data science team, but wonder if there would be follow-through if they were to go ahead and make the proposed changes.
e. Unclear coverage of product areas. There are many allocation and prioritization challenges under this model. How does work get prioritized by the data science manager? Which product teams get the most attention? Which decisions are made with data in mind, which are made without, and who would be making these calls?
f. No clear sizing and allocation strategy. With a center of excellence model, it is difficult to determine the number of data scientists needed. Does the size of the team grow with the size of the organization? If yes, why? There is no clear strategy for determining the size of a pure research team. This is a drawback of all fully centralized models.
Benefits and success scenarios
It should be noted that the CoE model works for many types of teams. Centralization helps focus and agency. You should centralize that which you can clearly encapsulate from the rest of the organization. Centralization works when coupling is low and joint planning meetings are few and far between. As an example, consider the various infrastructure and tooling development teams. Once the company decides on a technology or programming language, tooling improvement efforts can happen regardless of the product launch timelines.
In the accounting model, also known as the BI model, the data science team would produce reports and presentations on a recurring basis (usually monthly and quarterly). The data science team would inform the organization of notable movements in top-level metrics. Once the team identifies an interesting trend, they would work with product teams to investigate the root cause. Thus, quite frequently, playing detective becomes a main part of the data science team under the accounting model.
There are two main drawbacks to this model:
a. Difficulty in closing the loop. As mentioned above, it is near impossible to reason based on global trends. This drawback becomes especially pronounced when there are many product teams and hence many moving parts and levers. It is important to have analyses and metrics which are tied to levers (product teams) so that they are actionable quickly and with less cost and reorganization needs.
b. Underutilizing technology. Having monthly and quarterly reports be the only function of the data science team is failing to fully gauge product quality before reaching certain calendar milestones. If product changes are leading to less usage in a certain market or for a certain group of users, the drop happens a launch at a time, not a quarter or even a week at a time. A product opens up to misuse and abuse a launch at a time. Data security is breached a launch at a time. Determining the launch that led to decreased usage in a particular geography or language after many features have shipped is an impossible tasks; so is determining the launch that led to abusive behavior due to incorrect incentives.
c. Low quality and stale data. Every new launch creates new sources of data that need to be incorporated back into existing metrics and should be considered in future analyses. Accounting data scientists miss all of the important updates, and usually rely on stale data for analyses. It is difficult to be involved in instrumentation from the sidelines. This is a drawback of all fully centralized models.
Reporting on quarterly trends of company metrics and their leading indicators is valuable practice. The centralized aspect of the BI team allows for a holistic view of the SBU, thereby leading to recommendations and decisions leading to global optimizations that can balance and correct local decisions. This work is something that the data science team should be tackling as their charter, regardless of the model under practice.
The consultant model
In the consultant model, the data science team is assigned tickets or emailed with questions. Data science managers then prioritize the tickets and questions and assign them to data scientists.
In this model, the data science manager overrides any existing data science roadmaps to prioritize the questions and needs of stakeholders. Due to the symmetrical treatment of all members of the team, this model makes managing a data science team easy and cheap.
There are many drawbacks with this model:
a. Communications overhead. Data scientists in a consulting position usually lack the context to resolve questions effectively in a timely manner. There is communications overhead involved in gaining familiarity with data sources and their creation process. Further, if a follow-up to an analysis is needed and the original consultant data scientist has other ongoing commitments, the work will get assigned to another data scientist. This requires yet another onboarding investment—and thus the cycle continues.
b. Unclear deadlines. It is difficult for stakeholders to know when work would get prioritized and assigned to a consulting data scientist. The volume of incoming requests to the data science team can vary depending on product team needs. Even after work gets assigned and prioritized, it is difficult for the data scientist to be able to estimate the amount of time needed to answer questions due to their unfamiliarity with the limitations and nuances of the ever-changing data sources.
c. Short-term ownership. Innovation happens when people plan for years, not days and weeks. Having data scientists act as short-term consultants makes it difficult to incentivize focus on complex or tedious work. This work is needed to ensure quality data, quality experimentation tools, quality data manipulation and visualization tools, and quality results.
d. Lack of motivation and unfulfilling work. Data scientists working under this model usually lack motivation as they are rarely involved in the product decision making process. They also usually find the work unfulfilling as they rarely see the results and impact of their work.
e. Unclear ownership. When projects are one-off and ownership unclear, people step on each other’s toes. This happens inadvertently but is a non-negligible source of inefficiency. It should be noted that this is a drawback of all fully centralized models.
f. Recurring bugs and emergencies. Without maintaining good data practices, products that rely on data as input fall prey to recurring bugs and emergencies. In this model, data scientists are pulled into a project in order to play detective and identify the source of the bug. However, finding the culprit is nearly impossible in these situations. Apart from the unfamiliarity of the data scientist with the data creation process and the product’s evolution, there is also the possibility of missing data due to missing instrumentation. It is impossible to find a needle in a haystack when the needle is not instrumented. It is also painful to find a needle in a haystack that is extended to the fourth dimension (of time).
The embedded model
In this model, product teams hire their own data scientists. Each engineering manager is in charge of planning for data scientist headcount, hiring, and allocation. The data scientist within each product team has the engineering team members as their peers.
This model brings welcome independence to the teams and relieves the SBU of the management requirements of a centralized data science team. It solves problems with team sizing and communications. It also solves the ownership and motivation issues that exist in fully decentralized models.
While there are gains in data science management costs, this model has important drawbacks:
a. Management complexity. Title and role diversity on the same team lead to management headaches. It is difficult for a single manager to maintain and assess multiple career ladders for different members of the team. Usually, the manager is inadvertently biased towards assessment against a single ladder—that of the standard engineering ladder. This incentivizes the data scientist to take on a role symmetrical to other engineers on the team, undermining the original point of hiring a data scientist. Hiring data scientists and putting together the right interview panel is also a challenge within this model.
b. Mentorship deficit and difficulty in maintaining uniform data standards and best practices. Data scientists benefit and learn from working closely with their peers, in particular during analysis reviews. An embedded model does not readily offer a path to a recurring and persistent relationship amongst data scientists. In addition, independent data scientists on each team design their own processes and standards. Weak standards is a drawback of all fully decentralized models.
c. Underutilizing technology and data science de-prioritization. Some teams might put off hiring a data scientist due to pressing deadlines and cost. This leads to shortcomings in data quality on these teams that becomes cumbersome to fix later. Humans like evidence-based decision making; even if it means using stale, incorrect, or anecdotal data. In addition, in the absence of good data, services are still deployed.
d. Local rather than global optimization . When there is no central ownership over metrics and key results, teams choose metrics and projects that lead to local optimization. Further, in this scenario, teams are incentivized to compete and ignore cannibalization effects. Local optimization is a drawback of all fully distributed models.
The democratic model
In this model, it is believed that easy and straightforward access to data by product managers, designers, engineering managers, and engineers would lessen or remove the need for a data science role. Many identify the main need for hiring data scientists to be the lack of proper infrastructure for fast and easy dashboard creation.
It is valuable to invest in data infrastructure and tooling that makes data access, processing, and visualization available to everyone. This investment is in particular valuable to data scientists as it frees up time for proactive opportunity sizing, experiment design, metric design, model design, and general improvements in methodology.
While ensuring everyone has easy access to important data points is a noble goal, there are some drawbacks to this model:
a. Difficulty in mastering everything and maintaining data best practices. Usually, people are mostly specialized and interested in a particular set of topics and tasks. Being skilled at a company’s engineering stack is already a big feat. It is fine to offload design work and product work and data work. Data scientists enforce good data practices within the organization.
b. Dashboards are not data science.
The laws of shitty dashboards * Attack with Numbers
Disclosure: I have been responsible for building shitty dashboards. I personally made most of the errors below. I…
The product data science model
Between the extremes of the fully centralized model (the CoE model) and the fully decentralized model (the embedded model), there exists a spectrum of hybrid models that take certain characteristics from each of the aforementioned models. Taking advantage of the strengths of both models, while actively making up for their deficiencies is what makes hybrid models successful.
The product data science (PDS) model is inspired, only in part, by the matrix structure. In this model, individuals are simultaneously part of the data science function and the product. Data scientists, each a member of a product team, report into a central data science team.
a. Clear ownership. One important benefit is clear ownership of projects by the data scientists, due to their membership on the various product teams.
b. Standardized data science processes. Data scientists work together to establish best practices and onboarding flows within the team. They review one another’s work. They benefit from a unified career ladder, with managers who can assess their impact and can plan for their growth. Membership on the product teams gives data scientists a thorough understanding of that product, its limits, and its potential. This in turn allows a straightforward mapping of analysis to proposals for action. It is difficult to move fast if newly available insight does not map into reasonable and informed actions.
c. Quality data. Data scientists working closely with product teams also improves data quality. Every single launch changes the data, and so it is important to oversee its evolution with careful instrumentation.
d. Global optimization. Peers from various teams coming together has other benefits. Due to their collective birds-eye-view of the business, they are able to connect the dots, identify inconsistencies, and optimize globally. This is very similar to the way design teams operate.
e. Planning and allocation clarity. Another benefit of the PDS model is that it simplifies the task of determining the size of the data science team. Once you figure out how to partition the business into products, and you figure out the number of engineers per product, the allocation of data scientists can easily be determined as a proportion. I explain more here.
No model is perfect and each have their drawbacks. To quote Sinofsky,
Since there is no optimal or perfect organizational structure […] then the most important thing is to know the weaknesses of your structure and to compensate for them.
Below are some drawbacks of the PDS model.
a. Cost. One of the main arguments against the PDS model is that having an embedded data scientist on each team with centralized management is expensive. This assessment does not take into account the cost savings stemming from the increase in data and product quality, and the more effective use of data in assessing successful changes. Nevertheless, organizations should do what they can afford. In the beginning everyone is responsible for engineering and design and data needs. As the SBU grows, one can have separate functions handling each task.
b. Recurring conflicts due to lack of power parity. For success, all functional leads should have similar amounts of influence and authority. Without power parity, the benefits of cross-functional collaboration takes a toll due to recurring conflicts that go unresolved.
c. Information overload and the data science manager. Gaining the right amount of knowledge about all products supported by the data science team is not straightforward. However, managers need to be informed and curious about the areas under their purview to be able to effectively assess contributions, investments, timelines, and tradeoffs. They also need to be able to continually communicate the contract between the data science and stakeholder teams and be mindful of the team’s portfolio.
Having said that, the drawbacks of the PDS model have relatively straightforward solutions. In future posts, I will talk about some best practices for making this hybrid model work successfully.
Where a single product is under development, I recommend the PDS model as the best in efficiency and effectiveness in leveraging data for the business.
The PDS model is compliant with Grove’s Law,
All large organizations with a common business purpose end up in a hybrid organizational form.
It is also aligned with Hayek’s views on the use of knowledge in society,
P.S. I had an easier time saying all of this in a Tweet,
 Functional versus Unit Organizations by Steven Sinofsky
 Building Data Science Teams by dj patil
 Where should you put your data scientists by Daniel Tunkelang
 How to play well with others by Josh Wills