Managing a data science project

Published in

Data Science at Microsoft

12 min readAug 31, 2021

By Darryl Buswell, Lisa Cohen, Frank Lan, and Tim Lau

*Photo by* *Christina @ wocintechchat.com* on *Unsplash*

The ultimate goal for every data science project or initiative we take on is to maximize our business impact. Over the years, we’ve learned from pitfalls and developed practices to set projects up for success. In this article, we enumerate the stages of data science delivery and considerations along the way.

How do projects arise?

Occasionally, people ask how projects in our centralized data science organization arise. It’s a great question. In fact, one of our recent book club books, “The Phoenix Project,” encourages engineering teams to reflect on the types of work that flow through a company (and then use those insights to better prioritize and also reduce inefficiencies). Here are some examples of how our projects arise:

Planning

We start with a coordinated planning process. At one point we did this quarterly, but we moved to half yearly in order to align with the rest of Microsoft engineering (which uses a six-month “semester” planning process). This is a regular opportunity to reflect on our business goals, recent performance, competitive landscape, industry trends, and new opportunities. Based on these inputs, we establish a strategic focus for the period ahead, and capture these priorities in a vision document. We also author objectives and key results to establish success metrics in a clear and quantifiable way.

Prioritization

A common challenge in planning is prioritization. It’s why stepping back to look at the big picture during planning periods is so important, to ensure we’re investing in opportunities with the most impact. The first step during planning is to identify the “big rocks” we want to tackle. These become the top-down themes for the period. Using these guideposts, the division can align supporting work (and OKRs) to increase cross-efficiencies and produce a stronger collective “press release” story to tell at the end. This helps us avoid the trap of “spreading peanut butter” across a lot of activities (or focusing on the urgent but not important), which results in less impact.

Sources

The data science organization and business teams both raise ideas and proposals for our roadmap plans:

Business-led proposals: In one set of cases, the business teams whom we partner with (such as product management, engineering, marketing, finance, field, and so on) come to us with a proposal. For example, our product engineering team asked how we can leverage data science techniques on customer feedback at scale, in order to prioritize their backlog. In other words, which feature work should they pursue to help the most customers and solve the biggest problems? (In this case, we were able to apply natural language processing techniques and use topic modeling in order to extract and prioritize root causes from these textual data sources.)
Data science–led innovation: We also have projects that initiate from our data science organization itself. For example, in brainstorming how to best achieve the planning priority and OKR we had established for customer support CSAT, the data science team came up with an idea and model to predict CSAT for open cases, which helps support engineers better manage their backlog in order to maximize customer satisfaction. A key aspect of a mature data science organization is raising new ideas and engaging as a strategic business partner. This establishes a two-way relationship, with ideas originating from both directions. The data science organization is in a unique position because it has insight into data across the business. Given this, business leaders want to hear the data science organization’s perspective and recommendations on prospective strategy decisions. Finally, in some cases, it’s hard to pinpoint which “side” an idea came from, as we also work as “one team” with our business partners, and host joint brainstorming sessions together. For example, many of our experiments and updates to the Azure Free Account (adding new services, extending trial duration) have come about from this kind of collaboration.
Data science–led systemic solutions: When reviewing our projects, we also notice patterns and opportunities. For example, a common question we receive is “What is the impact of x?” where x may be a marketing campaign, a new service launch, a website change, and so on. In one case, we developed a multi-attribution model to solve this problem once and for all, at scale. Measurement frameworks, experimentation platforms, and data platform investments are all great examples of data science infrastructure that requires investment up front but accelerates our learnings and increases the accuracy of our results more broadly. (We typically balance these platform investments and “quick wins” or stakeholder-facing “features” so that we can maintain a stream of value while investing in our infrastructure and reducing debt.)

*Photo by* *Austin Distel* on *Unsplash*

What are the deliverables?

The data science deliverables we produce typically fall into three categories:

Analysis: A study using data to describe how a product or program is working. Examples include customer journey research, diagnosis to pinpoint a change in trend, exploratory data analysis, or a summary of topline business statistics.
Experiment: A scientific study to test a hypothesis. We use randomized controlled trials to evaluate causal drivers across our products and programs. Causal inference is another approach, where we create a synthetic control based on scientific approaches.
Model: A Machine Learning or statistical model that is trained on data to produce outputs, but without being explicitly programmed. An example is a churn prediction model, which leverages historical data to alert us about at-risk customers.

There are also a few additional dimensions within these categories of deliverables, such as in-production versus non-production, internal versus external facing, and real-time versus periodic refresh. While each of these types of deliverables entails specialized approaches, in this article, we focus on themes common to all of them.

Who are the “users”?

Our data science deliverables serve a variety of users, both internal to the company and external:

We refer to our internal users as “stakeholders.” As you can see above, many of these internal stakeholders are business teams whom we partner with to achieve shared goals together.

What are the roles?

In a prior article we outlined the roles in a data science organization, including PM, data scientist, ML scientist, and data engineer:

Who manages the data science project depends on the roles that exist in an organization. In many cases, the data scientist and data science manager carry this responsibility. However, if you have an opportunity to create a data science product manager role, that individual can play a key part in this process. Much like a traditional product manager in tech, the data science product manager is the subject matter expert on the user and on the business domain. For a data science product manager, the product refers to any of the three deliverable types above (analysis, experiment, and model). Many concepts from software product management apply to data science product management as well.

Lifecycle of a data science project

There are many frameworks available to describe the lifecycle of a data science project, including the Team Data Science Process from our Microsoft documentation. In this section, we use a simplified version to summarize the stages. For simplicity, we have outlined a data science project lifecycle that is built around three key phases: 1) Designing a concept to suit end user needs, 2) Developing a suitable data-driven model that can produce valuable insights, and 3) Deploying a valuable solution that enables end users to gain access to those insights. The process is iterative, and the team may need to move back and forth among phases. Furthermore, for the purposes of this article (about managing data science projects), we focus on the product management considerations and best practices at each stage to ensure successful delivery.

Scope the problem

The first step (and perhaps among the most important) is to form a clear view of the problem as well as the goals for the project. A well-executed design phase helps shape vision and direction, limits the number of iterations and additional cycles for the team to go through during subsequent phases, and helps ensure that what the team ultimately creates is valuable and makes an impact. To do this in a consistent and scalable way, we leverage a “project intake” process and form, which includes the following questions:

What is the problem to be solved? What is your hypothesis regarding how we can solve it?

These discovery questions help create shared context and a common understanding of the problem space and domain. At times a partner may come to us with a very specific request to see data in a particular way. When this happens, we always ask the team to “step back” and explain the business problem they’re trying to solve. This gives us an opportunity to leverage our data science toolbelt and suggest how we can best apply the power of data science to the problem. In these cases, we deliver what our partners “want,” even if it’s not what they were able to articulate initially.

Who will be the end users of the data science solution?

It’s important that this end user not be theoretical — in other words, merely thinking that “team X will probably be interested in this output.” Instead, the data science team must identify a specific individual, team, or population who will ultimately consume or use the output — and then work collaboratively with them throughout the data science lifecycle. Identifying an end user and understanding their needs significantly increases the likelihood that the data science output is not only valuable but also adopted. If appropriate, we also illustrate a workflow regarding how the solution will be used and by whom, to further create clarity and avoid confusion among members of the project team.

What is the action you will take as a result of this data science initiative?

This question helps remove the “interesting” — but non-actionable — project proposals from the queue. If we’re not going to do anything differently as a result, why invest time in the project? On the other hand, for projects that do drive action, and that we do take on, this question also helps prepare everyone involved to plan and commit to the necessary actions that need to take place in addition to the data science work so that the ultimate customer and business outcomes are realized.

What is the expected business impact you expect to see (for example, adoption, revenue, retention)?

This question helps us prioritize items in the backlog so that we can optimize our time in the places where it will have the biggest impact. There will always be more questions and ideas than the data science team can take on, so the team is empowered to apply their judgment and choose the highest impact projects to work on.

In our planning, we consider impact along with risk and cost (as per the impact/effort matrix) in order to optimize our efforts.

This question also helps define the success metrics for the initiative. Once we have a clear end goal to work toward, we can get creative and brainstorm how to best accomplish that end state.

Which strategic priority or divisional OKR does this support?

This keeps our work aligned and accruing toward a set of core priorities for the period.

Although this intake process may feel like “extra work” to kick off a project, we’ve actually found that it leads to the most efficient project delivery and highest quality end product. It’s when we assume the answers or don’t take the time to clarify these points that we end up delivering outputs that don’t meet our ultimate needs, causing us to rework solutions.

These inputs go into our planning process, in which we greenlight projects based on business impact and how they accrue to the themes of our broader business priorities. Each team uses consistent templates and calls out “hard cuts” to make the review efficient. Once a project is approved, we further flesh out the plans in a business requirements document to keep the project team aligned on goals and expectations. We also enter these projects in work item tracking (Azure DevOps, in our case), which creates a system of record for our plans. Each half-year we communicate our plans out at scale. Then, throughout the period we communicate broad updates so that everyone knows what’s committed and what’s coming when.

Develop the solution

There are many steps that the data science team undergoes while developing the solution, including designing the approach, gathering the data, exploring and cleaning the data, testing solutions, and more. This is an important phase in the project lifecycle to focus on de-risking. One key tool toward this is to develop minimum viable products (or minimum viable prototypes). A benefit of this approach is that it can create some quick wins along the way. But the larger goal is that it also generates early feedback and helps the team develop a better ultimate product, more quickly. (This also aligns with the two-week rule of regularly talking with the users of our product.)

During this stage in the process, communication is key to help the project team understand where things stand, what blockers have come up (so that others can help), and to align timing, handoffs, and more. Periodic steering meetings with stakeholder leaders are another opportunity for the team to stay in sync on priorities and expectations (building on the OKRs).

Deploy, measure, and socialize

Solution deployment involves packaging the data science model so that it can be consumed by the end user. As part of this process, the team works closely with the end user to ensure the solution meets their needs.

Key questions to ask during the deployment phase of the data science workflow include:

How should we integrate the output into existing processes and tools?
How can we provide explainability and the reasons for our recommendations, rather than leaving the model to function as a black box?
What support and training do end users need?
How do we monitor the health of the output to make sure it operates as expected?
What service level agreements are required to ensure business continuity?
What data contracts are required to ensure this level of support?
Was the project successful, as defined by our measures of success (both in terms of technical performance and business performance)?
Is the end user satisfied with the solution? What improvements would we want to make?

Solution monitoring, model health checks, and retraining

For any live service, we’ll need a sustainable solution monitoring system to check the ongoing health and performance of any automated and deployed workflows. The system can be designed to provide an alert about or even handle identified issues, such as those related to model performance.

Feedback and evaluation

The end user remains a customer after any data science output is deployed. We set up processes to receive ongoing feedback from the end user, especially for any products that live on as consumable insights in production. We use this feedback to measure the success and impact of the project against our technical and business performance goals. Finally, the cycle continues as feedback comes in, and ideas arise for new enhancements and opportunities to pursue next.

Updates and changes

Some large initiatives may end up spanning multiple planning periods, leading to projects that are essentially updates or changes to the original scope. In these cases, it’s best to apply all the same approaches above to scope and prioritize the updates to avoid scope creep.

Conclusion

There are many ingredients to set up a data science project for success. We hope the data science lifecycle we’ve described can be a useful way to organize the many considerations and techniques involved. At each stage (designing a concept to suit end user needs, developing a suitable data-driven model that can produce valuable insights, and deploying a valuable solution that enables end users to gain access to those insights), communication is a key tool to establish requirements and work through issues that arise. With these approaches, we hope you and your data science teams can maximize both the delivery and discovery aspects of their data science work to deliver results with business impact.