ML program management at scale (Part 2 of 2)

Published in

Data Science at Microsoft

11 min readAug 17, 2021

As data science and machine learning (ML) evolve, the role that program managers play on ML teams is likewise evolving, perhaps to the point of becoming its own new discipline. In Part 1 of this two-part article series, I went deep into the role and responsibilities of the program management at each development stage. In this article, I focus on the tools and process needed to lead successful data science projects. These tools and processes are iterative and evolve as your team matures and has a greater impact on the organization.

Processes and tools necessary for scaling ML Operations

To have a successful and efficient ML Operations (ML Ops) environment, or simply to be a successful ML program manager (PM), it is critical to have the right tools and process in-place. As a PM, you might choose to directly own, manage, or develop these tools, or you might work with your engineering team to develop them. Implementing them sooner reduces pain later and enables scaling to support numerous models simultaneously.

Documenting models and model outputs

It is critical to have a platform for documenting the ML model for customers. This is different from the internal site or documentation library used by your internal stakeholder teams. Public documentation for users and stakeholders should provide the following information: Description of the model, user guidance, business problem the model solves, model performance and limitations, how to access the model, privacy considerations, and implementation details. Because you needn’t expect most of this information to change frequently, having a static page or website ought to be suitable, but storing it in a database that is consumable by tools and processes — and that can be used for reports — is even better.

Documenting model outputs is necessary to ensure consuming teams know how to integrate and interpret the schema and data. Often model output and schemas change over time as the data science owner makes improvements and addresses customer feedback. If you have a small number of models, documenting them in a spreadsheet or document might be fine. When you start to manage roughly 10 or more models with incremental improvements, however, you need an automated solution. The model output documentation should include the following information: primary key (unique value), column/field name, field data type, and field description.

Ability to track model run status and issues

In my article from last fall, “Machine Learning model governance at scale,” I discussed needing a report to show run status. Having such a report gives visibility into status for leaders, stakeholders, and users without involving the program manager in basic questions about model status. Here is an example of such a report:

As a PM, providing this information helps to bring transparency, enable self-service, and grow stakeholder trust.

Tracking dependencies from input data sources through to the stakeholder

Another critical tool is an automated process to determine all ML dependencies. This effort is almost always an afterthought and so becomes a rush effort when organization or platform changes lead to breakage. Knowing the input sources for each model, knowing the stakeholder and users consuming models, and knowing the tools that consume the models are all important for providing accurate and timely information when there is an upstream data source issue, platform migration issue, or stakeholder business priority change. If you have many models that have evolved over time with many data scientists and engineers involved, then having an accurate dependency map becomes more unlikely.

Ideally, it’s helpful to have a visual or table similar to the following:

In a complex environment it is sometimes difficult to visualize all the dependencies involved. In the Comments section of this article, I would love to hear from you on what has worked in your environment.

Tracking stakeholders through data contracts

The ability to track stakeholder usage of models is important for understanding adoption, prioritizing improvements, and determining whom to inform of breaking changes or platform migration. If you have a small number of models and stakeholders, tracking this information in a spreadsheet is probably sufficient. As you evolve, you can create a data contract with the stakeholder as a document and then upload it to a document library. Note that maintenance of the data contract typically becomes a challenge as the number of data contacts grows and models and platforms change.

As the team grows and produces more models, you may find as an ML PM that you do not have enough information about the stakeholder business problem to be solved with the model or the team’s need for and expectations of the model. In that case, consider a self-service model that enables better scale under a data contract. As shown in the diagram below, with this approach the stakeholder provides all required information, such as the team and who requires access. Once all the information is provided, initiate the ongoing process of reviewing requests and granting access. Once this workflow is established, you can iterate to capture more relevant information and provide a process for extending data access requests after a certain period of time (e.g. three or six months) to ensure the data contract stays up to date.

The data contract input form should capture the following information:

Organization or team name
Stakeholder contact information
Additional contacts (e.g., engineering, business, or program manager contact)
The Application ID targeted for access
The stakeholder or endpoint consuming the ML model data

The model selection form should allow the user to specify the ML models or data sets to which they are requesting access. The selection of models includes the ones that the team wants to make available to stakeholders and are ready to support, which might include models in a public or private preview. This form must also contain other related information such a model refresh frequency, SLA language, model output location, and information on how to gain access.

After the data contract is created, confirmation is sent to the stakeholder, ML team, and other relevant contacts.

Feedback framework

Going into a feedback framework in detail is beyond the scope of this article, but be aware that you need a framework to receive automated feedback on model performance based on solid customer feedback. Automating this framework enables the team to scale through automated model improvements and to provide continuous learning into the model.

ML Operations process or tool for managing deployments

Depending on the maturity of the data science and engineering teams involved, it’s helpful to define types of model deployment. As a PM, you should be aware of all changes to existing models as well as all new models being deployed. Changes to models can be (1) hotfixes to address critical failures blocking a stakeholder; (2) model improvements, which might be planned or unplanned; and (3) new model improvements.

For any deployment, you should always ask yourself these questions:

What will the impact be on the stakeholder? Is there a breaking change?
Who will be consuming the model output?
Is there an existing data contract in place, and is the stakeholder engaged and ready?

What should a program manager in this space know about Machine Learning?

To maximize impact and success on an ML team, it’s critical for a PM to have at least a basic understanding of Machine Learning. Here I provide some basic concepts to get you started on your own learning journey in this space.

First, consider the question, “What is ML?” According to Arthur Samuel, computer gaming and AI pioneer, ML is “the field of study that gives computers the ability to learn without being explicitly programmed.” An interesting way to think about ML is that Machine Learning is when you have data and you have answers but you must determine the rules. This contrasts with classical programming, in which you have data and you have rules but you must determine the answers.

When to use ML — and when not to use it

As you work with stakeholders, the following information is helpful in determining when to use an ML model or some other solution instead, such as a rules-based engine.

Great candidates for using an ML model include the following scenarios:

When you have data and answers but you lack the rules and you want to provide answers for new or changing data. (As I mentioned above, keep in mind that for traditional programming you have the data and business rules but it’s necessary for you to generate the answers.)
When the rules or logic are very complex and so relying on only simple ones is not sufficient.
When you need to scale to a range of anywhere from thousands to millions of answers very quickly in what is not a short-term or one-time effort.
When you need to include specialized personalization by user or some other form of granularity.
When you need a solution that adapts in real time based on changing data sets.

In contrast, here are scenarios that don’t benefit from using an ML Model:

The problem can be solved by simple rules.
The solution requires 100 percent accuracy.
The solution has no need of adapting to new data.
The solution requires full interpretability.
The solution has specific privacy concerns not lending themselves to an ML model or must be offered in an unsecure way.
The data is not high quality, such as being unbiased, stale, or irrelevant.

Supervised versus unsupervised learning

Something to consider pertains to the differences between supervised and unsupervised learning. Supervised learning allows you to predict data output based on prior experience, learnings, and labeling. Unsupervised learning allows you to find new patterns in data when the learning data isn’t labeled. Supervised learning results tend to be highly accurate, whereas unsupervised learning results tend to be less accurate. Supervised learning happens offline, while unsupervised learning happens in real time. Understanding these differences enables you to gain some insight into how the data scientist is approaching the business problem and validating the approach with the stakeholder.

Types of Machine Learning

ML solutions aren’t monolithic. Instead, they consist of several types, which I summarize below.

Ranking

Description: Helps with finding what’s most relevant. Ranking models are often ideal solutions because ranking changes over time, is often complex, and has many inputs.
Examples: Google search and Amazon product search (the latter of which changes over time and is customized for individuals).

Recommendation

Description: Provides users results for what they are most interested in based on past experience. The implication of the results here are along the lines of “We think this is what you want, and the recommendation may not be 100 percent accurate, but we hope it is acceptable.”
Examples: Microsoft Outlook provides conference room recommendations based on your location and conference room availability; Netflix provides movie recommendations based on recent movies you’ve watched.

Classification

Description: Places something into a category. Often the logic isn’t too complex, but the scale of ML provides an advantage. This is a type of supervised learning.
Example: An ML model that determines the species of a flower based on a set of reference images.

Regression

Description: Forecasts or predicts a numerical value or outcome. This is a type of supervised learning.
Example: Predicting what a customer is likely to use over the next 12 months.

Clustering

Description: Puts similar things together without knowing what they are. This is a type of unsupervised learning.
Examples: An ML model that categorizes images of animals as belonging to different species of animals.

Anomaly

Description: Finds uncommon things or differences in pattern among them. This is a type of unsupervised learning.
Examples: A model whose purpose is to detect fraudulent charges.

Courses to take

In addition to what I’ve written about here and the references I include below, consider taking related courses to further expand your knowledge. Some of my favorite ML course include:

Conclusion

In this second article in this two-part series, I’ve attempted to provide a deeper look into the tools and processes to establish and keep ML projects on track. I recommend an iterative approach, starting small by establishing the process the making sense of your environment. As the team grows, you can continue to refine processes and establish tools to automate them. The sooner you start establishing these tools and processes, the easier it becomes for the team to learn to embrace and enable more scale in the future. Additionally, because ML is a rapidly growing field, it’s a must to ensure that your tools and process are evolving as well.

Given this ongoing evolution, it is critical to have a knowledgeable ML PM involved in these projects to ensure the effort and time spent developing models is efficient and scalable. It is important for an ML PM to partner with stakeholders and data scientists to ensure the best short-term and long-term solutions are being built. As I’ve mentioned earlier, this doesn’t mean every project will be successful. The learnings gained along the way, however, lead to more successes in the future. As an ML PM, you can help ensure projects stay on course and that the hard decisions are made on whether to continue investing in a project or switching gears when necessary.

Peter Saddow is on LinkedIn.

References

I would like to thank Ron Sielinski and Casey Doyle for reviewing this work.

Check out part 1 of this two-part article series here:

ML program management at scale (Part 1 of 2)

A deep dive into the role and its responsibilities.

medium.com