Data: A key requirement for your Machine Learning (ML) product

Published in

The Launchpad

8 min readSep 27, 2018

As Product Managers, we have to play the product-equivalent of three-dimensional chess by trying to solve for user, engineering, marketing, business, and other cross-functional interests. However, focusing on users first typically yields the best results. One tool in the PM’s toolbox to get started is the Product Requirement Document (PRD). It should, to the extent possible, capture the requirements a product needs to fulfill in order to meet user’s needs. This post explains how to talk about data in an ML-aware PRD.

Before we jump in, a quick disclaimer: ML is not a requirement. I have seen PRDs that mentioned “using reinforcement learning to optimize the reward for the user” as a requirement. That is usually a red flag. ML, in most cases, is a means by which a user requirement is met, not the requirement itself. After all, you wouldn’t say that you need a “distributed data processing job to crunch through websites to serve search results”, would you?

Why a PM needs to think about data

Why would a PRD include requirements about data? Because how data are collected and used will have a material impact on the product.

First, you have to consider what you are trying to predict (the output of your ML model) and whether you have the necessary feedback mechanisms in place. Let’s say you are trying to serve only the most relevant notifications to your users. How do you know if a notification is relevant? You could provide a mechanism that lets users “swipe away” notifications. However, does this feedback tell you that they were annoyed by the notification and wanted to get rid of it, or that the notification was useful and they are simply “done” with it? You may want to do some UX research before you settle on exactly how you define this feedback.

Second, you have to consider what data about the notifications (the inputs to your ML model) are necessary in order to make these predictions. What features might be useful? Think about the nature of the notification (e.g. informative, call-to-action) or the time of the day when the notification was served. Those are a good start, but I am sure that, with some domain knowledge, you can come up with many more.

Thinking through both the inputs and the outputs of ML models carefully, and then being sure you actually do have that data available, is exactly the level of foresight required of ML Product Managers.

Data make the ML feature possible

Figuring out what data are needed for a specific product or feature is the first and most important step in scoping data requirements. Machine learning models are nothing more than mathematical functions that take features as inputs, produce predictions as outputs, and learn how best to match predictions to patterns observed from the training data.

To illustrate my point, I will use housing prices as an example. Below is a table of five actual houses for sale in the bay area. For the sake of clarity, note that the known attributes of the houses are referred to as features, and the value that should be predicted is referred to as the label. Note also that you may choose another label to predict (e.g. using sqft, bedrooms, bathrooms, and price you could predict the zip code a house is in).

The above example for housing prices helps illustrate my first important reminder: you need labeled data for supervised machine learning (I will not cover unsupervised or semi-supervised ML in this post). Given we have a set of houses for which we know both the features and the labels, means that you have labeled data. The last row contains a house for which we don’t know the price. We can use the ML model that we train to predict this value, given the known attributes (features). In the introductory example of notifications, the label was the feedback the user provided when they dismissed the notification. If they indicated that the notification was useful, it is a positive label.

In many product use-cases, there is a natural source of labeled data: logs. In my previous blog post I covered an example from Google Forms which, based on the question prompt, automatically picks the question type. In the past, users have provided question prompts and then picked an appropriate question type manually. The question prompts are your feature, the question types that were selected manually are your labels. Here is an example of what this could look like:

Data acquisition needs a strategy

What data (or features) do you need?

In some cases, Product Manager’s or engineer’s imaginations are limited by data they know is readily available. However, more often than not you can enrich those data with features from other data sources. It’s a good exercise to start brainstorming a list of potential features that may be useful for a given ML task, disregarding feasibility or cost.

Let’s say you are trying to predict house prices and all you have is a table with sqft, bedrooms, bathrooms, and zip code. Could you think of other features that you could obtain that would help with this task? Perhaps the age of the house (which can be obtained from public records), or proximity to the closest grocery store (which can be computed using public maps data)?

Once you have a list of potential features that could help with the prediction task you can prioritize by availability (do those data exist?), accessibility (do you have the rights and consent needed?) and cost (how costly is it to collect those data?).

How much data do you need?

The answer to the question “how much data are required” could fill a textbook, but from a PM’s perspective, a few highlights are necessary to initiate conversations with the engineering team.

In most cases, more data is better than less.
If little or no data are available, transfer learning may help. In short, transfer learning allows you to take data and/or ML models from one task (e.g. classify dog breeds) and apply them to other tasks (e.g. classify cars). More on that in a future blog post.
In cases where acquiring labeled data costs money (and/or time), define a goal of where you want to get to (in terms of model quality/performance) and a threshold of how much money/time you are willing/able to spend.
At some point, more data will not help.
If you are looking for more information on this topic, try searching “power analysis for testing distribution similarity”.

To illustrate these statements, here is a simplified graph that highlights the potential situations in which acquiring more data may or may not be beneficial.

The assumption is that most ML problems are on the steep part of this curve, i.e. acquiring more data will lead to better performance. However, in some cases, where a great deal of labeled training data already exists, there could be diminishing returns, i.e. training on more data doesn’t improve the model quality.

This may seem obvious. However, it is crucial to know where on this curve you lie in order to make an educated decision on whether acquiring more data is worth the expected increase in performance. For cases where you need to acquire labeled data, there are several options available to you (see this previous blog post).

Are your data of good enough quality?

The quality of your features and labels can vary greatly. If you are scraping your logs for these data it is very likely that you obtain some nonsensical training data. In the example of Google Forms, some of your logs may contain differing opinions about the optimal question type for a given question prompt. In some cases, this may be intended (an ML model should not be too confident about a prediction if your users disagree about the best question type). In other cases, sources of bad training examples can be as simple as differently formatted feature values or semantics (e.g. some house prices are represented in thousands, others in millions).

At this early stage (writing a requirements document), there is not much you can do about this. Maybe you already have access to those data and can perform spot checks to find problematic examples. In any case, it is important to highlight potential areas of concern for your engineering team such that they can focus their efforts on investigating data quality.

In addition, if there are concerns about the quality of your training data, this is a red flag that you should raise in your PRD. This potentially reduces the quality of your ML model and increases the execution risk.

A note on privacy and security

Whenever you deal with data you should take the utmost care about how you store and process those data. Before you start collecting data, check with privacy and security experts on what you can and cannot do. Even within those constraints, consider the right thing to do by your users.

When you set out on your journey, always take the time to think about what benefits your users gain from any specific ML-driven feature. Be sure that those benefits warrant the use of the required data and that your users agree with that assessment. To continue with the notifications example above, allowing users to give explicit feedback whether a notification was useful or not may be acceptable, and users would be happy to provide this feedback in exchange for more relevant notifications.

Data requirements checklist

Below is a checklist of specific questions you should initially ask. Depending on where you are in the product life cycle (e.g. conceiving a new product, adding an ML feature to an existing product) these will be more or less relevant.

Data requirements

What data are needed/desired?
What are some features that are known, would be useful, etc.?
Are those features available? If not, how costly is it to acquire them?

Data acquisition strategy

Where are the above-mentioned data coming from?
Are there any quality concerns for existing data?
How much data do you think will be needed?

Privacy and security

Are data stored and processed in a secure manner?
Do you have permission to collect/use the data?
From a user perspective, do the benefits of a feature/product outweigh potential concerns they may have with providing data?

In summary, an important way for PMs to contribute to ML products and features is to closely observe the data requirements needed to enable magical product experiences through ML. Although ML is not a requirement, data considerations should appear in PRDs because they can impact the way a product is designed and how users provide feedback through the product. In future posts, I will address other areas that should be covered in PRDs, e.g. how to define metrics and goals for the launch readiness of an ML-powered product/feature.

Clemens Mewald is a Product Lead on the Machine Learning X and TensorFlow X teams at Google. He is passionate about making Machine Learning available to everyone. He is also a Google Developers Launchpad mentor.