Three Lessons for PMs Building Decision Products

Published in

Open House

10 min readDec 4, 2020

In this post, we discuss what “decision products” are and detail three lessons we’ve learned at Opendoor about building and using machine-learning-powered decision products:

Know your user and your data — they’re different
Nail the northstar metric that aligns your algorithm with your business
Regularly reconsider the rules of the game for each decision product

What are decision products?

A decision product is a product that uses an algorithm to recommend or make a decision that helps users. For example, what content will a user be most interested in seeing (e.g., a social network newsfeed)? Or how should we price this service to balance supply and demand (e.g., dynamic ride-share pricing)?

We intentionally call them decision products, rather than algorithms or models, to reflect that these products need the attention of product managers just as much as UI-focused products do. It’s critical that product managers devote attention to decision products because they not only present data science challenges, but deep — deep business, design, policy, and ethical challenges too. So, yes, while a decision product uses an algorithm, there’s much more to them.

Plus, there’s enormous business upside from good decision products. And bad decision products carry enormous business and societal downsides. If you’ve read the headlines in the past few years, you’ve probably seen examples of those who have been adversely impacted by biased algorithms, with ramifications far beyond the businesses that built those algorithms.

If you’re not familiar with Opendoor’s business model or the concept of instant home buying (aka “iBuying”), check out this Opendoor overview — the rest of this post will draw heavily on real estate examples from Opendoor’s business.

Lesson 1: Know your user and your data — they’re different

Start with the real world, not the data. Many factors that matter to users are difficult to quantify or categorize. And as a result, they don’t show up in most machine-readable datasets. If you try to understand the user problem purely from a data-driven perspective, you’ll miss these elements of the user experience that are difficult to quantify or categorize.

For example, a factor top of mind for most homebuyers is the home’s condition. But this is a notoriously subjective and difficult-to-quantify factor (much harder than square footage, bedroom count, year built, etc.), so most listings databases don’t even attempt to include it. However, user interviews indicate it’s normally one of the top factors on buyers’ minds when determining how much to offer.

Do these homes look identical? Listing data says they are, even though most buyers would agree the home on the right is in much better condition, having had its kitchen upgraded recently

The real world changes. So should your data. It’s obvious that data needs to be updated. What’s less obvious is that the structure of data also needs to be updated regularly, and sometimes quite suddenly.

For example, in one of our markets, our pricing accuracy had been steadily improving until mid-2017, when we started seeing a flood of inaccuracies. We checked all our systems and couldn’t find any bugs in the code. So then we started talking to home sellers and buyers in that market, and we learned about something that was significantly coloring their view of the market but that our pricing model simply wasn’t capturing: a recent storm had damaged thousands of homes.

All of the sudden, storm damage, or suspicion of storm damage, had emerged as a key factor in buyers’ willingness to pay. Storm damage had not been an important factor historically, so our model still looked good when backtested against past data. But the model was performing poorly in this new, storm-affected reality.

Once we understood how buyers’ preference structure had changed, the fix was simple: ask sellers if their home had sustained storm damage recently and price accordingly. But we had to change our data structure to get there.

Understand the layers between the algorithm and the user. It’s common for the UI in which algorithm results are displayed to be owned by a team downstream of the decision product team itself. It’s important the decision product team understands the details of the UI.

For example, when Opendoor offers to buy a home from a seller, we explain our offer using comparable sales (or “comps”). We’ve found it helps build sellers’ trust that their offer from Opendoor is competitive. In addition to ensuring that our decision product generates an accurate offer price, we need to make sure it reveals the comps that factored into the offer price. If we ignored the UI and kept the comps hidden, we still could’ve produced an accurate price for our offer. However, we would’ve missed the opportunity to strengthen trust with our customers through transparency.

An Opendoor offer to a homeowner, showing the comps used to generate the offer price

Lesson 2: Nail your northstar metric

When you work with decision products and data scientists, it’s easy to be overwhelmed by metrics. Luckily, not all metrics are created equal. As a product manager, you don’t need to know every single detail of every single metric used to assess an algorithm. But it’s important to understand at least one metric in full detail: the model training error metric, or what we call the “northstar” metric. To understand this, let’s look under the hood of a “learned” algorithm:

An algorithm or model takes in historical data, which includes inputs (aka independent variables or “features”) and outputs (aka dependent variables or “labels”). It then learns a relationship between the inputs and outputs that minimizes the error of its predictions versus historical “ground truth” outcomes.

The specifics of the historical ground truth, and the definition of error, are critically important. This is the crux of the model, and if these aren’t aligned with fundamental business and user needs, that’s not something that can be overcome by downstream improvements — the model is rotten at the core.

When we first started building a recommendation product for home shoppers, we optimized the v0 algorithm for clicks. It showed homes like the mansion on the left. Fun to look at, but few users are in a position to buy these homes.

A better approach was to focus our model on homes that buyers could actually offer on, not just click on. That algorithm showed homes that look like those on the right. They actually got fewer clicks — but they received more offers and were more likely to lead to sales, which is what ultimately matters for our users and our business.

Monitor your model in multiple ways. We’ve learned to monitor our decision products in different ways. To understand this, let’s expand on the diagram of a learned algorithm to include the serving layer (aka “live data”). We train our model on historical data, where ground truth is known; we serve on live data, where ground truth is not known.

The first metric we monitor is what we described above: accuracy in model training, an assessment of how well the model can understand past data. We retrain most of our models daily or weekly, and, if we see backtested accuracy decline, we investigate and do not promote the new model into production. However, this is not enough; we also need to check our live inputs and outputs.

Monitoring inputs. A key assumption of algorithms is that their production, or live serving, environment looks similar to their training environment. That means we need to ensure that live serving data doesn’t differ in unexpected ways versus training data.

For example, in one of our markets, we began noticing that our backtested accuracy had degraded. After doing some digging, one of our data scientists put together this perplexing chart — most homes listed in this market that met our purchase criteria had a nice bell-curve distribution, centered around $500K. But a small group of less expensive listings just over $50K began to appear.

Why were homes listed for $50K? Were these incredible deals? Not exactly.

We investigated and found that the less expensive homes were not for sale, but rather, t for rent. The low sales prices were actually (very) high monthly rents! Turns out we weren’t filtering out “RENTAL” listings, as opposed to “FOR SALE” listings. We had only excluded listings under a certain threshold and assumed no rental homes would exceed the threshold on a monthly basis. But rents had increased and that assumption had been violated, so our model was confused by what it thought was mansions being sold for low prices!

The immediate fix was to exclude rentals, but the longer-term fix, which was applied more broadly, was to check the shape of the distribution of each input and make sure it wasn’t changing significantly in any short period.

Monitoring outputs. We also want to monitor our live predictions, which are used to send out offers to homeowners who want to sell their home to Opendoor.

Evaluating any one live prediction is very hard. Our algorithm always thinks its prediction is good, but how can we tell? If the seller accepts it, that probably means it’s not too low — but maybe it’s too high. And if a seller rejects it, that probably means it’s not too high — but maybe they rejected it for some other reason besides it being too low. Evaluating any one offer, or user session, is a bit like rolling the dice.

However, we can get a better idea of our live algorithm output by looking at offers in aggregate, for example, by hour or by day. While we don’t know the right answer for any particular offer, we have a reasonable idea of “ground truth” in the aggregate — if too few offers were accepted, we’re probably biased low; if too many offers were accepted, we’re probably biased high. So we set up distribution checks, which alert our teams if a group of offers behaves significantly differently from what we expect.

Lesson 3: Regularly reconsider the rules of the game

Let’s return to the decision product diagram. We’ll make one addition — raw data inputs — in the bottom left. At Opendoor, this is how, for a long time, we thought about some of our pricing products. We came to believe that it covered all the major strategies for improving the product.

We had created “rules of the game” that were very linear: take in market data, train an algorithm on it, make offers. What was missing was a feedback loop, which would allow users (home sellers) to improve their offer. And sellers had a ton of feedback, especially if our offers were too low!

For many homeowners, their home is their most valuable asset, so they are often very conscious of their home’s value. For example, when a homeowner thinks their home is worth $300,000, and Opendoor offers only $250,000, they want to give us an earful. And homeowners often have knowledge of their home and neighborhood that Opendoor doesn’t. So if we show them the “why” behind our offer, they can point out our mistakes! For example, the seller may know of a particular comp that pushes up the price, or a comp we included, but isn’t actually similar to their home and should be excluded.

What’s notable here is that our incentives are truly aligned with our customers, even though Opendoor is the buyer and the customer is the seller. The homeowner obviously wants a higher offer. And we actually want to give them a higher offer, assuming it’s justified by good evidence. At Opendoor, we start and end with our customers, so our goal is to provide the most competitive and fair offer we can.

The solution? Change the “rules of the game.” By including an explicit feedback loop, we give sellers the ability to provide us new data (or warn us off bad data). With this new process, we could verify feedback and generate new offers.

Building good decision products is not easy. And the stakes are high, as poorly designed decision products can have deeply negative effects on businesses. But the lessons discussed above can give you more confidence that your decision products are making good choices.

Since Opendoor’s founding, we’ve built our business around decision products.. We expect the role of algorithms and machine learning will continue to grow, especially as we expand our products and user base in our mission to empower everyone with the freedom to move.

Interested in working on products at the intersection of B2C, finance, and machine learning? Check out our jobs page! We’re hiring for product managers, engineers, data scientists, and many other roles!

Three Lessons for PMs Building Decision Products

What are decision products?

Lesson 1: Know your user and your data — they’re different

Lesson 2: Nail your northstar metric

Lesson 3: Regularly reconsider the rules of the game

Written by Sam Stone