How to write better requirements for AI/ML products

All models are wrong, but what PRD do you need to prepare for them to be useful?

Sergei Zotov

Published in

Management Matters

8 min readJun 27, 2022

One of the best quotes in statistics is:

All models are wrong, but some are useful
“Empirical Model-Building and Response Surfaces”, George E. P. Box, Norman R. Draper (1987)

In the same book there’s an even better quote:

All models are wrong; the practical question is how wrong do they have to be to not be useful.

This extremely correlates with my experience of working with data scientists over the last 5 years as a Technical PM.

How to define “useful”?

Software Engineering

Imagine your typical software release. There will be some obvious goals such as:

I want a new button with the text “Buy” on it that sends this form. Here’s a wireframe

I want this API method that inputs queries at a 100 req/s rate and outputs this data from this Redis DB

There are some boundaries and requirements that managers set for this scene that are well-known nowadays. Because of that, the goal is clear and team always understands how to identify if the task is completed or not.

Data Science

When creating machine learning models, there have to be more inputs, not to mention that the possibilities are almost endless.

Because of that sometimes a Data team can get stuck in a loop: consistently experimenting to get that 0.5% more accurate model. But will it affect our ROI much considering the costs of getting that increase? Or potentially the model that has 5% less accuracy but significantly higher performance will benefit the business more?

A PM’s responsibility is to take their team out of this loop and lead them to a better business outcome. Potentially, through more diverse and interesting tasks for the team. But to do it effectively means better formulating goals and setting boundaries that are more specific to the Data Science tasks.

For example, is this a great goal?

I want an auto-speech recognition model that can be used to transcribe calls

or is there a way to make it clearer? Let’s dissect what requirements Data PMs need to provide to their team.

Business goals

What is the business goal of creating this model?

Not even a software PM should transfer the task to the team without some business context. Maybe the team will come up with a better solution? With ML and other DS tasks, this is even more crucial.

For example, a PM wants to decrease the time that takes the model to output some result, and they communicate this to the team as “Let’s train a lighter model”, but it turns out the team doesn’t need to train a new model, just tweak some things in configs.

PMs, even TPMs, still lack technical expertise. This is why it is crucial to provide some non-technical business context to the team. After all, it is all about communication and people. The team needs this to provide better results with a better understanding and motivation.

Data

Do we have data for the model?

What field / industry is relevant for the model?

How to prepare data for a great dataset?

The ASR (auto-speech recognition) models for medical terms and sales calls are very different because of the data they were trained on:

The first ASR may not get what “upsell” and “value proposition” are
The second one will get confused about what “penicillin” is.

Before training any model there has to be some understanding of where to find or how to create a dataset to train on.

One simple piece of advice would be:

The training data and production data should be as similar as possible.

A classification model that was trained on images of cats can be used on images of dogs, but it will certainly lack accuracy.

The unlabeled data also might prevent the team from training the model right away. For example, the team needs to figure out which phone calls were made by a human and which — by a voice bot. There is a need to do data labeling before training a model. At least, a PM needs to provide some keywords that only humans (or bots) might say to separate those calls automatically.

What is a good dataset?

Depends on the method you want to apply.

If it’s supervised machine learning, then it is data that has some features (X) to train on (that also should be included in production data!) and some target to predict (y).

For example, 2 files:

call_replic_1.mp3 as an X feature
call_replic_1.txt with transcription of this audio file as a y target

or a CSV file with columns:

height, weight, and age as X features
systolic blood pressure, diastolic blood pressure as y targets

If it’s unsupervised machine learning, then the dataset can be without labels (target y). Because the main purpose of unsupervised ML is to unveil some hidden information in data. For example, to identify groups of users based on how they are using the app.

The datasets also can differ from one another based on time. You can’t accurately predict the weather tomorrow based on the weather data of the 1960s.

Either way, it is good to communicate with the Data team about what data can be used in training the model. Maybe they’ll come up with some idea on how to synthetically / automatically label it.

Usually, it is a PM’s job to implement Data Science processes and allocate resources, so it is good to think about whether you need to:

outsource data labeling to the professionals (there are plenty of companies to do it) or hire an internal team for it. For both solutions, you’ll need to create instructions on how you want them to label the data.
implement a data annotation / labeling tool internally (there are lots of open-source libraries that can help with that) or use paid SaaS solutions.

Model quality

Our previously formulated goal also misses the key metric to measure the quality of the model. The process of training the model can go on and on forever. You can always tweak something here and there, but unfortunately, the resources are not unlimited.

Therefore there is a need to set some time and metric boundaries (with one exception: sometimes you can’t set time boundaries for projects that are primarily focusing on R&D).

It’s like an acceptance criteria for our model to be useful.

For example,

I want to decrease WER (Word Error Rate) on our ASR by 3%, but if we don’t do it in 2 weeks, we will try to make improvements somewhere else.

That sets some boundaries quite enough:

if we decrease WER by 3%, we’ll complete our task
if we decrease WER by 1% in 2 weeks, we won’t spend any time trying to decrease it more

Notice that here we use WER, but it can be so many other metrics in other cases, like accuracy, precision, log loss, perplexity, f1-score, ROC AUC, etc. For a manager, it would be better to still think in business terms or some key metrics that we know from the get-go (like WER in ASR). Otherwise, it would be better to communicate our business problem to the data scientist and choose a fitting metric together.

Again, I don’t believe that it is even TPM’s responsibility to choose such a metric. We might suggest some, but only a tech lead or machine learning engineer (data scientist) should make that choice.

Production and performance

How to create our data pipeline in production?

You need to communicate your starting vision so that your team could understand from which databases or text files to extract which data, in what order, and when. There’s also a data preparation process that needs to be thought through.

Does the model require some UI or API or just some cron task that executes a simple script?

Is there a need for a data scientist to create that interface, or it can be delegated to a software engineer, who is far more skilled in this area?

What kind of performance does our model require?

If it’s in tens of thousands of req/s, there is a need to think thoroughly about the server and interface architecture: is there a technical team and enough server power / resources to make it happen?

For example, is it necessary to use a fast programming language like C/C++ or Java here? Probably. Are there engineers in the team that know how to do it? If not, we probably need to hire them which will certainly move the deadline forward.

If it’s like 2 req/s, we can choose a “slow” high-level language like Python and be happy with it. Also, our team can probably use existing servers to deploy this service without a need to buy more server power.

I’d certainly recommend discussing those things well with the team.

Costs and revenues

We might have forgotten something very important when we discussed the metrics for our model. It would be better to understand the ROI too.

Is that 3% WER decrease really that necessary? How is that going to reflect on the revenue?

Let’s imagine 2 situations.

First situation

Let’s say we have a major client and we don’t want to lose them. In a small talk, our client shares that they want to explore our competitors and that they’re talking with one that has 2% less WER than our solution.

That’s a huge risk of losing this client. It might be better for us to try to at least match that competitor’s solution. If we can’t do that in the upcoming weeks, we need to try to improve clients’ user experience and win their loyalty somewhere else. Otherwise, we’d possibly lose that client and need to prepare a backup plan acquiring other clients.

Second situation

Compare that with, for example, our try to make “something”. We don’t have any huge data science ideas, that’s why we’re trying to just improve our product somewhere.

The ROI of those situations is totally different. Can our team spend much time and resources on a second one if there just isn’t enough ROI? I suppose not. We might try to think of some new hypotheses to explore.

Ok, what to do with all those requirements?

You need to communicate them to your team. A better way to do it is to document all your thoughts and send them prior to scheduling a call with your team.

That will give enough time for the team to get to know what their manager tries to achieve and the phone call will be more about clarifying some stuff here and there. It’s just proper managerial ethics — this technique saves a lot of time for the team.

Conclusion

Back to the first point, all models are wrong. It’s okay for them not to be as successful as we dreamt. Assuming that:

they solve a problem
do better than simpler solutions (e.g., regular expressions, simple programming algorithms, math)
have a positive ROI

Even awesome ML models like GPT-3 or DALL-E 2 are not 100% accurate! Don’t stress about it.

Just help your team to find the right solutions by providing better requirements. This is a very underrated type of communication, but it’ll lead to healthier relationships with the team and improve the product.

Let me know your thoughts on it. Should we add something, or is there some point you don’t like?

Hopefully, this article helped you somehow. Feel free to contact me on LinkedIn or leave a comment.