Understanding How Data Scientists Understand Machine Learning Models

Using interactive visualization to explore the meaning of machine learning interpretability

9 min readApr 24, 2019

tl;dr: Through an iterative design process with expert machine learning researchers and practitioners at Microsoft, we identified a list of explainable machine learning interface capabilities, designed and developed an interactive visualization system, Gamut, that embodied our capabilities, and used it as a design probe to investigate the emerging practice of machine learning interpretability through a user study with professional data scientists.

Published and presented at ACM CHI 2019.

Machine learning (ML) is now being used to address important problems like identifying cancerous cells, predicting poverty from satellite imagery to inform policy, and locating buildings that are susceptible to catching on fire. Unfortunately, these models have been shown to learn and hide biases. Detecting these biases is nuanced, particularly to novices, and cannot be found using common metrics like an accuracy number. This is troublesome when ML is misused, with intent or ignorance, in situations where ethics and fairness are paramount.

Using Google autocomplete as an estimation, it seems people are split on whether AI and machine learning are helpful or hurtful (or maybe just a crapshoot!).

Lacking explanations for models can lead to biased and ill-informed decisions, like representing gender bias in facial analysis systems, propagating historical cultural stereotypes from text data into widely used AI components, and biasing recidivism predictions by race. To combat this, there are now legal requirements that enforce a “right to explanation” from the most recent GDPR for any automated decision-making system that could impact a person’s safety, legal, or financial status.

Addressing the above problems, understanding what models have learned, and explaining their predictions is the problem of model interpretability.

But what is interpretability? Everyone seems to agree it describes a human understanding of an AI system, but an understanding of what is still open. A system’s internals (e.g., components and models), operations (e.g., the math), data mapping (e.g., input and output relationships), or representation used in an explanation? Therefore, it’s clear that no formal agreed-upon definition for interpretability exists.

Despite their problems, machine learning models are still being used today. In our work, we instead sought to operationalize interpretability, that is, to turn this fuzzy concept into something more easily usable and actionable. By breaking down interpretability into a suite of techniques, we can more quickly help data scientists ensure machine learning systems work with humans, not against them. This could help tool builders design models and visualizations where model biases and shortcomings are more readily discoverable.

A Design Probe for Understanding Interpretability

Our approach is to use a design probe: an instrument that is deployed to find out about the unknown — returning with useful or interesting data. This well-tested method in human-computer interaction consists of building a field-testing prototype that considers the needs and desires of a targeted user population, and then inspire and encourage the users to reflect on the prototype and emerging technologies.

We use a design probe to understand the emerging practice of model interpretability, because while building and deploying ML models is now an increasingly common practice, interpreting models is not.

Machine Learning Interpretability Capabilities

To build the prototype, we conducted formative research with 9 professional data scientists at Microsoft who use machine learning on a daily or weekly basis and gave them the following prompt:

Prompt: In a perfect world, given a machine learning model, what questions would you ask it to help you interpret both the model and its predictions?

After all our sessions, we grouped similar questions together, distilled the following 6 capabilities that an explainable interface to ML models should support. (An example question is listed next to each capability with the context of a real-estate model that predicts the price of homes given the features of a house.)

Local instance explanations. Given a single data instance, quantify each feature’s contribution to the prediction. → “Why does that house cost that much?”
Instance explanation comparisons. Given a collection of data instances, compare what factors lead to their predictions. → “What is the difference between these two houses?”
Counterfactuals. Given a single data instance, ask “what-if” questions to observe the effect that modified features have on its prediction. → “What if I added a bedroom to this house?”
Nearest neighbors. Given a single data instance, find data instances with similar features, predictions, or both. → “What are similar homes?”
Regions of error. Given a model, locate regions of the model where prediction uncertainty is high. → “Where is the model wrong?”
Feature importance. Given a model, rank the features of the data that are most influential to the overall predictions. → “What features of the dataset are most important to the model?”

While there is no guarantee of completeness, we found this list to be useful for operationalizing interpretability in explainable ML interfaces.

Picking Models to Explain

To test how data scientists use an interface to explain machine models, we also need models to explain! Selecting the probe’s model class required a balance of many ideal characteristics such as: providing a global explanation (an explanation that roughly captures everything a model has learned), is easy to understand computationally, has a clear contribution of each feature to the entire model, and remains highly accurate and realistic.

Of course, there is no model that is optimal for all these requirements. However, one model that has recently attracted attention in the ML community lends itself well to our desires: the generalized additive model (GAM). Thanks to modern ML techniques, GAM performance on predictive tasks on tabular data competes favorably with more complex, state-of-the-art models, yet GAMs remain intelligible and more expressive than simple linear models. And understanding a GAM requires only the ability to read a line chart. Lastly, GAMs have local explanations (e.g., instance prediction explanations), but also lend themselves to global explanations (e.g., model explanations), which other models lack; this allows us to test the relative value users place on having global understanding versus a purely local understanding of a model.

We won’t go into the details of the model here, but they closely follow traditional linear models, like linear regression.

Gamut: Interactive Visualization Design Probe

Using the capabilities listed above, we designed and developed Gamut: an interactive visualization system that tightly integrates three coordinated views to support the exploration of GAMs: the Shape Curve View shows global explanations of entire features on a model with data density histograms for context, the Instance Explanation View shows local explanations for single data instances and supports interactive visual comparison between instances, and the Interactive Table shows the raw data and can be sorted, filtered, and can compute nearest neighbors based on query instances. The three views are tightly integrated and interactive, and embody our 6 capabilities from above.

The **Gamut** interactive interface integrates three main views together to support model exploration and explanation.

For concrete examples of using Gamut, check out the paper, and to see a demo of Gamut and an explanation of each of the views and visualizations, check out our overview video below:

User Study

We ran a user study where 12 professional data scientists at Microsoft spent ~1.5 hours each using Gamut to understand different models. We asked the participants to think aloud so we could get insight into their analysis processes. The participants were first given the features of a dataset and asked to write down questions they would have about a model trained on said data. They then used Gamut to answer these questions, as well as other questions we had prepared for them beforehand.

Takeaways

People need interpretability for different reasons, so consider interpretability capabilities for future interfaces. 🔬

We found that interpretability is not a singular, rigid concept. While most people agree interpretability should describe a human understanding of a machine learning system, the community hasn’t yet agreed on what part of the system should be understood. This is reflected in our work interviewing professional data scientists at Microsoft, and motivated us to operationalize interpretability so that data scientists can understand, debug, remove bias, and improve their models today without an agreed-upon definition.

“What are the features? How are you getting those features? What are the quality of those features? They’re just literally saying, ‘I’m forecasting the number — here’s the number you use.’ I’m going, ‘That just is not satisfying.’”
— Study participant

But what other tasks do data scientists need interpretability for? Using interpretability, data scientists engaged in hypothesis generation, by testing and validating questions about a model, confirming or rejecting prior beliefs about the data, and ultimately beginning to trust a model. Interpretability also helped them understand the data, where here, a model is not the desired outcome, rather the insights that the process of modeling produces about data. When a model was the desired product, interpretability helps data scientists build and iterate on models faster and more confidently. Lastly, data scientists use interpretability to communicate their results to various stakeholders. A holistic, practical, general-purpose tool for interpreting machine learning models should support all these tasks.

Tailor explanations for specific audiences. 🌎

Data scientists are constantly communicating model results and analysis to different types of people: management, their technical peers, and other stakeholders. Therefore, model and prediction explanations should exist on a spectrum, where data scientists can tailor explanations to specific audiences, considering a balance of simplicity and completeness.

“When you’re going to craft your story, …you’re going to have to figure out what you want emphasize and what you want to minimize. But you have to always lay out everything. Know your audience and purpose.”
— Study participant

We also noticed that two explanation paradigms, global explanations (e.g., model explanations) and local explanations (e.g., instance prediction explanations), are in fact complementary. While work on local explanations has shown their usefulness, our finding motivates further research in creating higher-level, global explanations for entire features of a model, or the model itself. While using Gamut, our participants used each paradigm independently to answer questions about model and data, and in a few circumstances, used one to inform the other. We noticed a preference towards explanation paradigms too: ML novices tended to use local explanations more often, ML familiars gravitated towards global explanations and higher-level feature trends, and ML experts used both together, for example, using global explanations for context-setting, like a backdrop, behind local explanations.

Design and integrate effective interaction. ⚡

While having an interface displaying data instance predictions and visualizations was helpful for getting an overview of what a model had learned, it became clear that interaction was key to realizing interpretability. Interaction was the primary mechanism for exploring, comparing, and explaining predictions, and when prompted, our participants couldn’t conceive of a non-interactive means to answer the questions they had written down at the beginning of the study, even though the common practice for visualizing GAMs is flipping through static charts.

“I want to understand bit by bit how the dataset features work with each other, influence each other. That is my starting point.”
— Study participant

Users liked the tight linkage of the multiple views of Gamut together, citing a lack of interactive tools as a missed opportunity in their current work. Having an interface for real models with real data also helped ground discussions of interpretability, a notoriously nebulous discussion point. Lastly, interaction helped the data scientists solidify model understanding, by promoting active inquiry into features of the model and inspecting different subgroups of data.

We hope the lessons learned from this work help inform the design of future interactive interfaces for explaining more kinds of models, including those with natural global and local explanations (e.g., linear regression, decision trees), as well as more complex models (e.g., neural networks).

Future tools for interpreting models should support many practical tasks such as the ones we identified, enable interactive exploration and explanation of models, and support multiple types of explanations for diverse stakeholders.

From our work, it is clear there is a pressing need for better explanatory interfaces for machine learning, suggesting that HCI, design, and data visualization all have critical roles to play in a society where machine learning will increasingly impact humans.

Gamut · CHI 2019

A Design Probe to Understand How Data Scientists Understand Machine Learning Models. CHI 2019.

fredhohman.com/

Authors

Fred Hohman (@fredhohman) is a PhD student at Georgia Tech.
Andrew Head (@drewmikehead) is a PhD Candidate at UC Berkeley.
Rich Caruana is a Principal Researcher at Microsoft Research.
Rob DeLine is a Principal Researcher at Microsoft Research.
Steven Drucker (@sdrucker) is a Principal Researcher at Microsoft Research.

Acknowledgments

We thank Sarah Tan, Jina Suh, Chris Meek, and Duen Horng (Polo) Chau for their constructive feedback. We also thank the data scientists at Microsoft who participated in our interviews and studies. This work was supported by a NASA Space Technology Research Fellowship.