This is part 5 of the 6-part tutorial, The Step-By-Step PM Guide to Building Machine Learning Based Products.
We previously discussed how to set up your organization to work effectively— let’s see how to help your users benefit from the results.
ML Models and Results Are Not Easily Explainable
Many ML algorithms are black boxes — you input a lot of data, and get a model that works in mysterious ways, which makes the results difficult-to-impossible to explain. In many algorithms there are interaction effects that makes the models even trickier to explain — these are relationships between various features that explain parts of the behavior that cannot be explained by just adding up the effect of each feature individually. Think about it as a compounding effect between features — the whole is greater than the sum of its parts, in a lot of strange and intricate ways not digestible by a human.
That said, you and your team will need to be convinced that the results make sense, and it’s easier if you can understand the results at a level beyond dry statistical metrics. It also helps to identify cases that are not covered or areas where the results don’t make sense, as we saw in the model building phase. This is even more critical with your users — in many cases they will require an explanation before they trust your results. You won’t have credibility from the get go, even if your results are 100% accurate, so your users may require an explanation to the outcomes you’re showing them. In extreme cases you may even be legally obligated to explain the result, as in the case of rejecting a loan application — by law you have to give the customer a reason for the rejection. To add complexity, your model will not be 100% accurate, as even 80% accuracy is considered quite good, so in some cases users will want to understand results that are actually wrong! In other words, the bar is for your user to see blatant errors once in a while (which they likely will) and still trust your results overall. That’s a high bar to meet.
The challenge is not limited to models used for external customers — the need to earn user trust applies to internal users as well; even though they’re rooting for you, your internal teams are a lot less likely to adopt results they don’t understand or trust. I’ve seen cases where teams preferred using an easy to understand rules engine rather than a ML model that is likely to produce significantly better results, just because a rule engine is explainable — humans write the rules so they can understand them.
This is not a problem you can set aside until after you built the model(s) — it is important to think in advance about the data and components of the model that the user may want to see and how to present results in a way that builds user trust. The answer may actually change your approach to building the model, and will help prevent a situation where you have an answer and no way of explaining the answer to a user. That said, thinking about this in advance will not exempt you from doing so again once you have the results, since model building is inherently an iterative process and extensive changes to your approach may be required if your model has changed.
Modeling for Explainability
The need for explainability may influence the way in which you build your model, including the level of granularity you need to support. Let’s say we’re building a platform for investors to evaluate startups based on Marc Andreessen’s framework, which states that the three core elements of each startup are team, product, and market. Now, Andreessen believes that the market is the most important factor, but let’s say other investors believe that a good team can find ways to grow a small market, therefore the team is more important. You can come up with an overall score or probability of success for companies that combines these 3 dimensions and gives some “absolute truth” you believe is correct, but investors may not necessary buy it. More specifically, investors may want to understand how good the company is on the particular dimension they care the most about. In addition to your model, you may need to be able to give them visibility into that. Here are a couple of different approaches:
- Build 3 separate models for market, product and team, each evaluating the company on a single dimension. Then build an aggregate model which combines all these features (and potentially many others) into an overall result. Investors could both look at the aggregate result and the specific dimension(s) they care the most about.
- Build an aggregate model and find a way to extract the features that are most applicable to each dimension from it and give users a sense of their value and importance, or to show data points that align with each dimension to build confidence in the results.
The right approach depends largely on your problem space, available data, modeling approach etc., but should be explicitly discussed and evaluated before you move on to prototyping the model.
Presenting Results to Users
When deciding how to show results, the goals should be to make them clear, believable and most importantly — actionable. There’s no playbook here — every problem will result in a different presentation of the results. I’ll review a few possible approaches and considerations to give you some ideas.
- Backdating. This is taking historical data and plugging it into the model to produce past predictions that could be verified against known values. For example, if you build a model that predicts values in year N based on data from year N-1, you could plug into the model data from 2 years ago and see if the prediction for a year ago is correct, since that information is known. This is also a potential way to test your models. Depending on the completeness of your historical data, it may be challenging to get enough data coverage to use it fully without some simplifying assumptions and/or model tweaks (e.g. if your model uses data from social networks, you can’t backdate to a time where social networks didn’t exist). This is also not feasible for certain models, such as reinforcement learning models.
- Explaining your methods and inputs. Simply telling the user what the types of data you took into account in your model builds trust by helping them see that the decisions are based on the same types of variables they would consider themselves if they had to make this decision. In the Andreessen startup evaluation example, a brief explanation for the market evaluation piece could be: “The market potential of a startup takes into account the number of companies in the market and their total sales, the growth of that market in the past 5 years worldwide, the number of new product launches over time, M&A activity in the space and macroeconomic trends”. While it’s not a full explanation, it does gives the user a glimpse into the black box. This is definitely necessary if you’re introducing a new score, as previously discussed.
- Exposing some of the underlying data. This approach is the easiest for users to understand and believe since they see the data for themselves, but not always the easiest to design and has several downsides: You need to expose data you may not wish to or be able to expose (e.g. because it’s proprietary, due to legal constraints etc.), the data may not agree with the conclusion in some percentage of the cases (modeling is about probabilities), or the data may not even be there; the algorithm doesn’t need full data coverage for 100% of the entities it evaluates — it can make up for gaps in the data if it has a large enough data sample for similar entities.
- Simplifying and only showing select results to facilitate decision making. Unsurprisingly, Amazon does it well with some of its product recommendations. I searched for a knee sleeve, see below the page I landed on from Google search results.
Amazon knows for each related product the exact probability that I will buy it, yet instead of giving me 30 sorted options of similar products, I’m presented with a very easy choice between two — either the cheapest or the best selling + top rated product. I know nothing about the criteria they used to pick the comparative set of products and whether it corresponds to what I would pick if given the option, but at this point they made the decision so easy for me that I don’t really care.
- Defining a new metric. One question you should consider is whether you’re creating a new metric (a new type of “score”) or predicting a well understood one: You can build your model to predict an existing metric (e.g. revenue of a company, value of a house etc.); alternatively, you can create a score that embodies a certain concept that doesn’t yet have an accepted metric, in order to enable stack ranking of entities by that concept (“the FICO score for <industry X>”). The decision largely boils down to whether there’s a single metric that expresses the business objective you’re trying to reach with your model, or it’s a mix of several factors that need to be weighted somehow. For example, if we’re evaluating the attractiveness of a commercial real estate asset for retail use, we may want to create a “retail fit score”, which will be comprised of several components such as sales per square foot, all in cost per square foot and location brand value contribution (say you believe a location on Fifth Avenue adds brand prestige beyond the pure foot traffic it attracts). In that case, there’s no single metric that encompasses these metrics, so you’ll likely have to model each component individually and then bring them together through some type of relative weighting. An important consideration when choosing to go with a new score is that you will likely have to spend more time and effort to educate your users about it. Think about having to convince the first financial institutions to use the FICO score when it was first introduced…
- Precision doesn’t always matter. A lot of models generate results that are a very precise number — probabilities, values etc. If you show them such precise numbers you risk users taking them more literally than you intend. A home with a predicted value of $583,790 isn’t definitively more valuable than a home with a value of $580,625. The margin for error is probably much greater than the ~$3K difference. Sometimes displaying the results to the customer in those precise terms is counter productive and gets them to read more into the numbers than they should. It may be a good idea to consider giving results in ranges, deciles, grades or some other less precise measure of value rather than showing the customer the exact output of your algorithm.
- Strategically providing access to raw data. In addition to showcasing the results of its own risk models, Lending Club provides access to its raw data for other people to build their own ML models on top of it. Is this approach relevant for you? Could it drive growth in some other part of your business? In addition to providing potential monetization options, this approach also helps the ML research community to accelerate the pace of innovation in the space. As an example, the availability of Microsoft’s COCO and the CIFAR data sets has greatly benefited image classification capabilities.
Again, the choice of user experience highly depends on the subject matter, product and user needs — there’s no “one-size-fits-all”. It’s entirely possible that none of the options above would be remotely relevant for your product. The key takeaway is to not underestimate the amount of thought and effort you need to put into the user experience side of the problem — even the best model is useless if users can’t understand, trust or act upon its output.
An Extra Geeky but Important Note
Explainability is an evolving area of ML research, with researchers actively looking for ways to make models less of a black box. One example is LIME (Local Interpretable Model-Agnostic Explanations, also here): An “explainer” for models that are classifiers (algorithms that map input data into categories or labels) that is used after the model is built to explain the results in a human-digestible way. Whether it’s relevant and/or sufficient for your purposes depends on the specific case and models you use.
Another area of research is Layer-Wise Relevance Propagation (LRP) — a technique to “deconstruct” the prediction of neural networks to visualize and understand the contributions of individual input variables to the prediction.
While the engineering aspects of building an effective machine learning infrastructure are largely outside the scope of this tutorial, keep in mind that product needs can affect engineering decisions and requirements. More on that next.
To read more of my product management writings and to subscribe to my newsletter visit https://producthumans.com/. More content (including ML) coming soon!