ML program management at scale (Part 1 of 2)

Petersaddow
Data Science at Microsoft
11 min readAug 3, 2021

Introduction

As data science and Machine Learning (ML) evolve, the role that program managers play on ML teams is likewise evolving, perhaps to the point of becoming its own new discipline. In this two-part article series, I describe the unique responsibilities of a program manager (PM) on an ML team, the value that a PM adds, and how to be successful in this new role.

In this first article, I go deep into the PM role and its responsibilities. In the second article, I discuss the tools and processes that a PM can use and initiate, respectively, to enable a high-quality and scalable ML team. In the second article I also cover the basic data science concepts that any ML PM needs to understand and offer references for further growing knowledge in this expanding field.

Regardless of whether you’re new to program management or you’re a seasoned PM focused on more traditional software projects, I’ve written this article to help you understand the role that PMs play in helping ML teams deliver models to internal teams that either improve business conditions or solve business problems. The key difference between delivering models to internal teams versus external customers pertains to the level of customer impact, obtaining feedback, and the data involved.

This article does not go into details around how planning for the delivery of an ML model fits into the larger business plans of an organization. I will, however, discuss planning considerations within the scope of a single model.

What are the roles and responsibilities of an ML program manager?

A PM leading and supporting an ML team has these general responsibilities:

  1. Stakeholder engagement: Understanding the stakeholder’s business and the needs that an ML model can address. It’s important to set realistic expectations of what an ML model can and cannot do and when an ML solution can be provided.
  2. Technical functional requirements: Creating a functional spec that details the technical requirements of the ML model, including data needs, model development, UX design, A/B testing or model validation, and finally the model outputs as well as real-time or batch scoring. Technical requirements must document the success criteria for measuring the impact of the model to ensure that established goals are being met, as well as how model feedback is incorporated back into the model to improve its accuracy.
  3. Project planning: Identifying and forming a project team with specific roles and responsibilities, which in turn sets expectations with all interested individuals and teams on the delivery of the model. Meeting regularly to identify risks and blockers and then mitigating them through the development of the model is key to success.
  4. ML operations/Livesite support: Establishing the process, governance, and Livesite support necessary to effectively scale and support many models in a production environment.

Depending on the size of the team, a program manager might focus on a single one of these responsibilities or on all of them. Regardless, a program manager is responsible for ensuring they’re all covered, identifying any gaps, and working with leadership and stakeholders to find solutions.

Roles and responsibilities of a Machine Learning program manager.

Additionally, to assist other disciplines and be effective, a program manager must understand the purpose of the other disciplines. This helps to distinguish a great from a good ML program manager.

What are the roles on an ML team?

As with any software engineering project, several roles and related sets of responsibilities are needed on an ML team to ensure that the project is successful. Here is an overview. As mentioned, depending on the size of team, these could be fulfilled by multiple people or by a single individual.

  • Data scientists have skill sets encompassing analysis, creativity, and communication, with a special focus on strong math skills in multivariate calculus or linear algebra. Data scientists understand and draw on a range of analytical techniques in their work such as Machine Learning, Deep Learning, and text analytics.
  • Program managers are responsible for understanding the business domain, performing the initial feasibility analysis, and then defining, facilitating, and leading programs while unblocking teams and otherwise helping out as needed with end-to-end ownership of the project.
  • Data engineers are responsible for architecting, building, and supporting scalable platforms for large and complex data sets. Their job is to provide quality data that meets the parameters of an agreed-upon data contract (or service-level agreement, sometimes called an SLA). Data engineers might need to work across the organization to get access to data sources from other teams.
  • Dev Ops engineers are responsible for providing the platform for running and monitoring ML models. They work with the data scientists on deploying or “productionalizing” the ML model while monitoring the overall execution of ML models.
  • Software engineers are responsible for developing software solutions. In the ML space, that might include building tools or user interfaces for gathering data, labeling data, or allowing users to interact with model output.
  • Analysts are responsible for gathering insights from the output of the ML model by analyzing and interpreting results using statistical tools and techniques. ML PMs often get involved in the analysis, since they need to understand the ML performance and insights they are delivering to stakeholder and users.
  • Leaders support the team by providing resources and helping to prioritize and plan projects, while also unblocking teams as needed.
  • Stakeholders and users are the ones with the business need calling for the ML solution, or they are the ones interacting with the ML solution.

A program manager’s role can grow and stretch into that of a data scientist, analyst, or software engineer in a small organization, or when help is needed in other areas in a larger organization.

How is a PM involved during each phase of the ML project?

As I discussed in my previous article, “Machine Learning model governance at scale,” the stages of an ML project include the following:

  1. Conception: This is when it’s determined that an ML model is needed. The need may be surfaced by the model owner, leadership, program manager, or a stakeholder. During this initial phase, all parties might not yet be fully committed, and some necessary parties might not yet even be involved.
  2. Prototype/model evaluation: During this stage, the data scientist decides on the algorithm to be used and then validates the initial idea and the assumptions behind the model. This phase helps define the scope and expectations of the project and the expected impact of the model to formalize the plan and move it forward. It’s important during this stage to secure the support of all team members who might be involved. If, however, during this stage the prototype does not validate initial assumptions, it’s important to seriously consider whether to continue.
  3. Production ready: This stage is dedicated to getting the model ready to run in a production environment by meeting and exceeding compliance requirements and any privacy criteria that apply. Model code and all dependent components should be fully automated to ensure the model can run with no manual intervention. The plan on how the model will be supported in production should be documented, including who will be responsible for addressing issues as they arrive, the expected timeframe for addressing them, and who else should be contacted or notified, including the responsible owner.
  4. Deployment: In this phase the model is moved to the production environment and set to execute according to a defined schedule and SLA with the stakeholders. Data sources are also onboarded and refreshed according to a set schedule. The engineering team or another team might be responsible for the actual deployment and requires support from the model owner to resolve any deployment issues. It is important to establish a data contract with upstream data providers and stakeholders to ensure expectations are documented.
  5. Production and monitoring: In this stage, the model is running according to the defined schedule and is being monitored. Any issues encountered during execution are addressed according to the supportability plan already established. The root cause of any failures should be investigated and understood to ensure they don’t repeat. In this stage it is important to continuously improve the infrastructure and supportability of the model to improve its quality going forward. An ML PM can help to monitor the production environment and monitor model performance to ensure the stakeholder is provided the results they expect.
  6. Deprecation: This stage applies when a decision is made to no longer support the model because the cost is too high, a newer model exists, or adoption isn’t meeting expectations. Prior to stopping a model running in production, it is important to notify stakeholders and other users and provide a path forward for them. An ML PM needs to ensure the stakeholder is prepared and notified of the deprecation.

Now let’s take a closer look at the PM role in each stage.

Initial project conception

This stage involves the initial idea to build a model to solve a business need. It includes:

  • Gathering information about the business context for model requirements.
  • Identifying the project team.
  • Collaborating with stakeholder and engineering teams.
  • Determining how the PM should start driving the project forward, and teams to be involved in both the short and long term.

Data acquisition and exploring

This phase involves getting access to data needed for the model. It includes questions such as:

  • Is the data of high enough quality?
  • Is the data labeled correctly for training the model?
  • Are there any issues to unblock with getting access to the needed data?

Developing a “minimal viable product” (MVP)

This step involves building a prototype to establish the capabilities, limitations, and risks early in the project and seeking buy-in from stakeholders. Here are key steps:

  • The PM works with data scientists on the definition of a prototype, with feedback from stakeholders.
  • At this point, it’s beneficial to establish clear expectations with stakeholders on when the model can be ready for evaluation and production, establish model run frequency, and enumerate risks.

Deployment readiness

This stage involves determining whether the model is ready to be supported and deployed into a production environment. Readiness includes addressing questions such as: Are all data sources fully automated? Are privacy reviews completed with any issues addressed? Are data sources fully supported by providers with a data contract in place?

  • The draft version of the model output is shared with stakeholders and the PM is on point to confirm that model output has been reviewed and approved.
  • If not, feedback to that effect, along with updated timelines, is communicated.

Production

This phase involves providing a level of support that stakeholders expect. Note the following:

  • The key question to address at this point involves clarifying the PM’s role in supporting the model in production. Part of this is ensuring any model issues are resolved quickly, within the established SLA, and providing related communications to stakeholders.
  • When issues arise in production, the three main causes are upstream data sources, model or model pipeline, and platform. It is the PM’s role to facilitate quick identification of the cause and then to work with the affected team to resolve.

Deprecation

This step comes into play when a model has run its course and needs to be removed as a functional and supported model. Sometimes a model can no longer be supported, which also causes the need for deprecation. It’s important to:

  • Clarify the PM’s role in deprecating the model.
  • Validate with stakeholders that they no longer need the model. If a situation exists that requires the model but the team can no longer support it, can the team provide an alternative?

Throughout the development cycle, the PM must ensure that all relevant parties are aware of current progress. Especially in the ML space, the scope of a project can quickly change based on changes in stakeholder requirements and needs, if the data is not available as expected or not of good quality, if the effort to find a suitable model takes longer than expected, and ultimately deciding when the model is good enough.

To keep team members apprised, a program manager must establish a set of periodic reviews. Here are some types of reviews to consider:

  • Peer review: This is a technical deep dive into the model solution that looks for feedback and input from the team. Peer review should be conducted about the time the data scientist has a good idea of what needs to be built and has a prototype to validate the ideas, but not so late that any feedback would be costly to incorporate or have a significant impact on the schedule.
  • Scrum: This is a quick, frequent forum for data scientists to provide a status update, identify priority conflicts, and delineate next steps. Often this isn’t a forum for finding solutions, as the right people might not be involved in the scrum or coming up with a solution might take too much time. A scrum happens often — either once or twice a week — and looks at individual work items.
  • Project review: This is a review of all models currently under development and happens about once per month. This review has the appropriate leads involved who need to be aware of project status and risks, or provide guidance. In this forum sometimes it’s necessary to make hard decisions on whether to continue developing a model or change direction. In the past, we experienced the problem of developing models for an extended period without deploying them, and having a project review helped us to adjust scope and mitigate risks without continuing to waste time and energy.
  • Stakeholder meetings and reviews: These are various meetings with stakeholders for the purpose of understanding requirements, providing developments status, gaining ML model acceptance, and determining next steps to keep the model moving forward through feedback and changing business needs.
  • Deployment model review: This is the final check on whether a model meets the requirements for productionalized deployment. At a minimum, the deployment checklist should include checks on whether the data sources are automated, stakeholders are engaged and ready for the model, the model can be supported by the team, and privacy reviews have been completed. This review should happen any time a new model is near deployment or an existing model is to be updated. It helps ensure the data scientists and engineers agree on supporting the model.

Conclusion

In this article, I’ve delved deep into the particulars of the ML PM role. I’ve presented the ML development stages and how PMs can lead their teams to success. A successful ML PM may need to step into other roles to ensure the project stays on track and delivers. To do this, it is important for the ML PM to understand the complete development cycle and the responsibilities of the other disciplines involved. In the next article of this two-part series, I go in depth on establishing the tools and processes needed at all development stages to ensure the ML project is heading the right direction.

The ML field is changing rapidly and will continue to be a leading field for many years to come. It is therefore critical to have a knowledgeable ML PM involved to ensure the effort and time spent developing models is efficient and scalable. This doesn’t mean every project will automatically be successful. But the learnings from every failure, if used well, can go into creating future successes. A PM can help ensure projects stay on course and that the hard decisions are made on whether to continue investing in a project or switching gears when necessary.

I would like to thank Ron Sielinski, Casey Doyle, and Sowjanya Yaddanapudi for contributing to and reviewing this work.

Peter Saddow is on LinkedIn.

Check out the next article in this two-part series here:

--

--

Petersaddow
Data Science at Microsoft

Senior Technical Program Manager | SQL Server | Business Intelligence Solutions Architect | Microsoft Azure