Designing a data science organization

Published in

Data Science at Microsoft

8 min readJul 21, 2020

Data Science continues to be a growing and evolving field. Given this, there are multiple approaches in the industry for how to structure Data Science roles and organizations. In this post, I’ll share our current approach and considerations.

To centralize or decentralize

A key decision companies face is whether to decentralize (or federate) their data scientists among product and business teams, centralize them into a common reporting structure, or maintain a hybrid of the two:

Any of these approaches can work, and each has pros and cons. In Microsoft’s Customer Growth Analytics (CGA) organization, we’ve adopted the centralized model and created a unified data science organization that reports into the VP of Engineering for our division, which is part of the organization responsible for Microsoft Azure. (This creates close alignment between data science and engineering.) There are also different levels of centralization that occur. In our organization, we’ve brought together the following roles into a central data science organization: data scientist, machine learning scientist, data engineer, and program manager.

Given the size of Microsoft, there are multiple “centralized” data science organizations within different product teams (such as Azure, Office, Windows, etc.), so you could argue that Microsoft takes a hybrid approach at the company level. In fact, with over 150,000 employees, there are instances of every model. For example, you can find examples of federated data scientists in situations where a handful of data scientists focused on a specific domain are embedded in an associated team.

In some companies (such as small- to medium-sized startups focused around a core set of products in a specific business area), the centralized data science organization actually reports to a VP of Data Science, who might report to the CEO. This provides data science with a prime “seat at the table” from which to promote a data-driven culture for the company. It also helps the team maintain an objective point of view, by not being aligned with any particular part of the business.

Centralization advantages

Here are some of the benefits we’ve observed from bringing together data scientists from across the division into a centralized organization, as our team and the data science discipline have grown (essentially quadrupling over the past five years):

Career paths, peers, and mentorship: I often hear from data scientists coming from the embedded/distributed model and interviewing with our group that they’re looking for a team where they can be with peers from the same discipline. In the embedded model, you might be the only data scientist within a team of software engineers. This can make it hard for you and your leadership to have familiarity and awareness with the career path skills, training, and performance criteria for data scientists. You may also get pulled in to help with tasks that are more aligned with the other disciplines in the team. In contrast, one advantage of a centralized data science organization is that it provides more opportunities for teamwork and collaboration with other data scientists. This enables peer feedback, mentorship, onboarding assistance, and opportunities to learn from each other’s approaches. It’s also easier to learn about and see career paths in the discipline, with more examples nearby. With more data science problems at hand for the team to tackle, there are a broader set of technical problems to choose from, based on your goals and interests. Finally, this model provides opportunities to rotate among subject area domains from within the team.

Consistency: Another benefit for the business is that data scientists from the central organization establish common best practices and leverage consistent approaches. This includes adhering to standards for how the team does its work. Taken together, standards provide simplicity for the leadership team, which is able to see comparable metrics across different parts of the business and be confident they’re developed in the same way.

Leadership and influence: By having data scientists report to a central leader, data science is positioned as a key function in decision making at the leadership level, and pushes company strategy to be more data-driven. While this is not required — it’s certainly possible to have leadership and influence outside of any specific organizational model — it doesn’t hurt when data science has a seat at the table in key discussions about company decisions and direction.

Specialization: Bringing together a broader set of data scientists also allows for specialization within team roles, such as data scientist, machine learning engineer, data engineer, data visualization developer, and others. Benefits of this approach are that individuals can go deeper into their craft, improve quality through increased experiences, pursue specific training, and learn from dedicated communities.

Resources and infrastructure: Centralizing results in economies of scale. For example, a common data platform conserves storage and maintenance costs. It also simplifies the process of managing compliance, security, and reliability by delegating these responsibilities to fewer owners in the group.

Scaled processes: Data science processes such as adapting to data platform migrations, incorporating new technologies, and adopting experimentation platforms become more efficient because they’re done for multiple team members at once. Data scientists also benefit from leveraging each other’s work on related analyses and developing common libraries and tools to share. Finally, resource allocation is more efficient because it’s easier to align the centralized data scientists with top priorities from the latest planning cycle, rather than re-allocating embedded data scientists across teams.

End-to-end perspective: One of the most powerful aspects of this model is that it brings together datasets from all parts of the business to enable a broader understanding of the overall customer experience. The intersection of these datasets enables team members to make more connections and gain greater context.

Decentralization advantages

On the other hand, the decentralized approach has benefits, too:

Range: While we discussed the advantages of specialization above, there are also advantages from having a range of roles and wearing multiple hats — the idea of a polymath. You can learn a broad set of skills from having a diverse set of experiences and responsibilities. I’ve also heard this phrased as one of the appeals for being a solo data scientist at a startup, for example.

Business context: Sitting in an embedded group makes it easier to maintain the business context of that group through osmosis, including participating in regular team functions and experiencing the general running of the business. This can provide helpful context to understand datasets, ensure sound analyses, and make recommendations that can be implemented. It can also offer just the spark you need to come up with the next great idea.

Stakeholder connection: Sitting in the same organization as your stakeholders makes it much simpler to maintain great connections, requiring fewer discrete practices to foster these ties compared to the centralized model.

Driving change: One advantage of stakeholder and business proximity is that it can expedite change resulting from data science analyses. With less distance involved, it simplifies the cross-organizational planning, coordination, and influence required to drive actions based on data insights. If you’re working on the machine learning behind a customer-facing feature, being part of that product group can also help you stay connected with other features. This can be helpful for designing the end-to-end customer experience and aligning on product design principles and practices.

Stakeholder engagement

It may be natural for someone in one of these organizational structures to imagine the appeals of the other. Overall, however, both approaches have their own benefits. The key is being aware of the benefits and drawbacks of your current situation and developing systems to address the latter. For example, a key need in the centralized model is strong stakeholder engagement.

We’ve found a few practices particularly helpful:

1. Planning processes: There’s no shortage of questions data can answer, and each seems of utmost importance when considered in isolation. But step back periodically to prioritize the backlog and plan with stakeholders about how to align for the biggest impact as a key practice for shared success. Although the ideal cadence for this broader planning depends on your business, it will typically be by sprint (in an agile process), by month, by quarter, or by semester.

2. Steering meetings: Schedule a recurring checkpoint to stay connected. While you may have more frequent project-level touch points, it’s important to have broader strategy discussions too. Here are some agenda items to include in these meetings:

An update from the business stakeholder to share what’s top of mind, the business direction, strategy changes or learnings, and key focus areas for the coming period. This conversation typically leads to ideas for collaboration opportunities, but if you don’t create space for it, it’s easy for the stakeholder to forget to mention these topics.
Present progress on recent data science deliverables to get feedback and ensure the results are being used. Feedback from our data science model users is key to continuously retrain and improve performance. However, qualitative feedback in these forums is valuable as well.
Share the current data science backlog to stay in sync on priorities and timelines. Track dependencies and risks so you and your stakeholder can drive them forward together.
Mention related work happening in other parts of the data science team that this stakeholder may want to leverage or align with.
Provide an opportunity for retrospection to aid continuous improvement. Discuss what’s working well, what isn’t working well, and ideas for how to continue growing the partnership going forward. (This can also occur in a more private meeting setting.)

3. Project intake: When you start a project (either at the beginning of a planning milestone, or because of a new event that has come up since), it’s important to clarify the goals, objectives, and use cases upfront. Here are some key questions we like to ask to align with the person making the request:

What business question are you trying to answer?
What action or decision will you take with this data?

What is the business impact you expect to see?

4. Communication: Maintaining strong communication is essential. Leverage the channels available at your company to enable group chats or emails, track work items, set clear expectations on cross-team deliverables, and publish the results of your work.

You can read more tips about stakeholder engagement and business context here.

Conclusion

We hope this post provides perspective on the considerations between centralized and embedded organizational structures, as well as practices to consider in each.

We welcome your input on this evolving topic. How has your company decided to structure the data science organization? What are the pros and cons that you see?

Lisa Cohen is on LinkedIn.