The role of a technical program manager in AI projects

Published in

Data Science at Microsoft

8 min readApr 5, 2022

Venture Beat has reported that 87 percent of data science projects fail and never move to production. Technical Program Managers (TPMs) can drive change to this statistic and help data science and engineering teams build successful AI projects.

In this article, I explore some of the program management considerations for Machine Learning (ML) projects and suggest a learning path for TPMs to increase their skill set for projects that have an ML component.

Considerations for ML-focused projects

These apply across the lifecycle of ML projects.

Is it an AI problem?

Consider Jack, who opened his notebook and started churning out Python code, eventually leading to him and Jane getting close to increasing the accuracy of their image detection model to 95 percent. Voilà! They could now distinguish a cat from a dog (I know, another cat-dog cliché). Time to push the code into production and relax! And then someone asked: What customer problem are we solving?

In Artificial Intelligence (AI) projects, the ML component is generally a part of an overall business problem and not the problem itself.

A TPM can play a vital role in the realization of value for ML within the larger scope of a project. TPMs determine the overall business problem first and then evaluate whether ML can help address a part of the overall problem space. Considerations include:

Engaging experts in human experience and employing techniques such as design thinking to understand the customer needs and human behavior first. For example, if your customer wants to talk about topic modeling instead of personas, you are having the wrong conversation.
Focusing on system design principles to identify the architectural components, entities, dependencies, interfaces, and constraints involved. Ask the right questions early and explore design alternatives with the engineering team. Mona Soliman Habib, Principal Data and Applied Scientist at Microsoft, considers this to be one of the core fundamentals of an ML-focused project where TPMs can have significant impact.
Thinking hard about the costs of ML and whether you can solve a repetitive problem at scale. Many times, you can solve customer problems with data analytics, dashboards, or rule-based algorithms. I know these solutions might not be quite as cool, but it’s true!

Taming the ambiguity beast

OK, so let’s say you concluded that ML might help in your project. Now what?

ML projects are plagued with a phenomenon we can refer to as “death by unknowns.” It’s like you were told that angels will meet you at the end of a dark tunnel, but no one mentioned that there are Merpeople, Trolls, and a Basilisk waiting for you (Harry Potter fans are smiling). Unlike software engineering projects, ML-focused projects can result in quick success early (e.g., a sudden decrease in error rate), but this may flatten eventually. Here are a few things to consider:

The most important part of being a TPM is to set clear expectations. If the customer wants more than 95 percent accuracy with limited data and time, they are being unrealistic.
Identify the performance metrics and discuss a “good enough” prediction rate that will bring value to the business. An 80 percent “good enough” rate may save business costs and increase productivity, but if going from 80 to 95 percent would require unimaginable cost and effort, is it worth it?
Create a smaller team and undertake a feasibility analysis through techniques such as EDA (Exploratory Data Analysis). A feasibility study is a much cheaper approach to evaluating data quality, customer constraints, and model feasibility. It allows a TPM to better understand customer use cases as well as the current environment (e.g., do we have data access and enough data?) and can act as a fail-fast mechanism. Note that a feasibility study should be agile (measured in weeks) and scoped to bring immediate value to guide project decisions.
As in any project, there will be new needs (additional data sources, technical constraints, hiring data labelers, business users’ time, and more), so factor these into your estimate to avoid surprises.

Notebooks != ML production

You may have seen a YouTube video where a coding ninja worked her Python magic to build a natural language processing engine in notebooks. She even shared the .iPynb to get you started. Sadly, that will not get you to production. A TPM must be the myth buster here:

Understand the end-to-end flow of data management, including how data will be made available (ingestion flows, formats, and frequency), how it will be stored, and how it will be retained. Plan user stories and spikes around these flows to ensure you are building a robust ML pipeline and not a demo.
Your engineering team should follow the same rigor in building ML projects as in any software engineering project. We at Microsoft CSE (Commercial Software Engineering) have built a good set of resources from our learnings in our engineering playbook.
ML-focused projects are not “one-shot” release solutions; instead, they must be nurtured, evolved, and improved over time. Here’s an analogy that Craig Rodger, Principal Data and Applied Scientist at Microsoft, suggests: “It’s like deciding to adopt a puppy: It may look cheap to begin with, but it should be seen as a 15-year commitment and journey.”

Garbage data in -> Garbage model out

A book can be (and many have been) written on how data quality is a major factor affecting model performance. A TPM may not get directly involved in the data cleansing and engineering activities, but still must understand why the data team keeps saying, “the data is not good quality” (as if it were the sushi from that shady restaurant downtown).

During feasibility, have your team generate a report on data quality that includes missing values, duplicates, unlabeled data, expired or invalid data, and incomplete data (such as having only male representation in a people dataset).
Understand data source reliability (e.g., are the images from a production or industrial camera or taken from an iPhone?).
Understand data acquisition constraints (legal, contractual, privacy, regulation, and ethics) before leveraging the data sets.
Identify whether there is enough data for sampling the required business use case and how the data will be improved over time. The rule of thumb here is that data should be enough for generalization to avoid overfitting.

You need more people — like right now

An ML project has multiple stages, and each one may require additional roles. Examples include Design Researchers and Designers for Human Experience, Data Engineers for Data Collection, Feature Engineers, a Data Labeler for labeling structured data, engineers for MLOps and model deployment — the list can go on. A TPM must factor in having these resources available at the right time to avoid any schedule risks.

Your feature is not my feature

TPMs are familiar with features because they are the core release mechanism for any project. In the ML world, however, features have a whole different meaning. Feature Engineering enables the transformation of data so that it becomes usable for an algorithm. The input to Feature Engineering is raw data, and the output is generally called a Feature Vector. Creating the right features is an art and may require experimentation as well as domain expertise. Consider these factors when planning a schedule and allocate time for domain experts in the project. For example, for a natural language processing engine for text analysis of financial documents, we hired financial researchers who were able to run a relevance judgment exercise with the engineering team to identify the right features.

It’s a biased world out there

Bias in ML could be the top issue of a model not performing as intended. There are many reasons for bias to get introduced in the data, which can in turn have an impact on model performance. A TPM can help the ML team by identifying the right use case scenarios and target personas. For example, for a person-recognition algorithm, if the data source is feeding only a specific skin type, then the production scenarios will not provide reliable results. Think about responsible AI principles from the first day to ensure the fairness, security, privacy, and transparency of the models. Here is a shout out to Tempest van Schaik, Senior Data and Applied Scientist and Bujuanes Livermore, Principal Design Researcher at Microsoft, for relentlessly emphasizing responsible AI principles across our CSE projects.

The learning journey for a TPM

As I’ve described, ML projects require the appreciation of some additional skills beyond those of the conventional software development processes. So, should a TPM start learning about gradient descent and how to build Convolutional Neural Networks? Well, not exactly. (But all power to you if you want to!)

What is required for a TPM is to understand the nuances and complication of an ML project; however, that does not necessarily mean that a TPM needs to become a data scientist or an applied ML engineer. Here is a simple framework to think about your up-skilling for ML projects:

PM fundamentals

Core to a TPM role are fundamentals that include bringing clarity to the team. Design thinking, driving the team to the right technical decisions, managing risk, managing stakeholders, backlog management, and project schedule management are a TPM’s superpowers. A TPM complements the ML team by ensuring that problem and customer needs are understood, a holistic system design is evaluated, and customer objectives are driving stakeholder expectations. Here are some references that may help: Should a Technical Program Manger (TPM) be technical?, The TPM Don’t M*ck Up Framework, and The mind of the TPM.

ML fundamentals

A TPM must spend at least some time to understand how ML Engineering works. Note, this does not mean that a TPM must know in depth about how to build models, but a TPM should be aware of the model development and deployment lifecycle and the constraints around feature engineering, model evaluation, model training, model monitoring, and more. Here are some resources to help:

AI learning paths by Microsoft
Microsoft CSE Engineering Handbook for ML
Books: Data Science for Business: Thanks to Chris Auld for the recommendation. The Hundred Page Machine Learning: A succinct summary of ML concepts. Machine Learning Engineering: Covers the ML project lifecycle well. Agile Data Science 2.0: How to leverage agile techniques for data science projects. Designing Data Intensive Applications: System design for data-driven architectures.

Domain applicability

Understanding the underlying domain is important for TPMs to be able to have the right AI-solution conversations with customers. For example, a TPM who understands the quality inspection scenarios in a manufacturing assembly line can understand how a defect detection algorithm will help optimize process efficiency and reduce costs for the manufacturer. Additionally, a TPM who understands a domain can benefit from knowledge of the technical constraints within the deployment environment, such as how networking and security constraints in a manufacturing plant can be used to help better define non-functional requirements.

Conclusion

I hope this article helps provide an understanding of some of the specific needs of ML projects from a TPM’s point of view, and areas where TPMs can increase their skill sets to be even more effective in their next AI project.

Nik Sachdeva is on LinkedIn.