**Multiple Perspectives on the Multiple Comparisons Problem in Visual Analysis**

The more visual comparisons an analyst makes, the more likely they are to find spurious patterns — a version of the Multiple Comparisons Problem (MCP) well known in statistical hypothesis testing. We discuss recent research from Zgraggen, Zhao, Zeleznik & Kraska (CHI 2018) that investigates this problem through a careful study of how a group of students identify insights in data using a visualization tool. We describe why studying MCP is exciting in its implications for work at the intersection of visualization, human-computer interaction, and statistics. However, we also question several assumptions made in studying MCP as a visualization process so far. At stake is the integrity of visualization tools for supporting exploratory data analysis (EDA) in ways that align with organizational values for data analysis, and our understanding of what it means to do “good” versus “biased” data analysis.

What is the relationship between hypothesis testing in statistics and examining a set of visualizations? An intriguing idea proposed by some statisticians (including Andreas Buja, Dianne Cook, Andrew Gelman, and Hadley Wickham) is that when we scrutinize visualizations looking for patterns, we are in fact doing a series of *visual hypothesis tests*.

The translation of a *visual comparison* — for instance, examining whether sales volume appears to differ based on whether it is a weekend — and a *statistical hypothesis test* is a bit imprecise. However, many agree that visual comparisons are analogous at least in spirit to hypothesis tests. And this has implications for analysis.

We begin with a review of the Multiple Comparisons Problem (MCP) in the context of visualization. We next cover Zgraggen et al.’s CHI 2018 study; readers familiar with the paper can skip ahead if desired. We then offer a critique and thoughts on future work.

**Multiple Comparisons as a Visualization Problem**

As harmless as they may seem, if visual comparisons are akin to hypothesis tests then we should control their potential to produce false discoveries. If you’ve run hypothesis tests to confirm that relationships observed in data are reliable, you probably used adjustment techniques — Bonferroni, Tukey’s Honest Significant Difference, or others — to account for the fact that the probability of finding a spurious pattern due solely to chance increases as one runs more tests. Researchers have suggested adapting visualization systems in ways that mimic such adjustments to combat this *Multiple Comparisons Problem* (Wall et al. 2017, Zhao et al. 2017, Zgraggen et al. 2018).

This idea is important — and exciting! — for several reasons. This line of thinking breaks with the assumption that analysts are reliable observers, free from bias. It also challenges the idea that analysis tools do not need to model an analyst’s goals, beliefs, or behavior. Attempts to address the Multiple Comparisons Problem suggest that a visualization system *must *maintain a model of analyst behavior, such as in the form of logged interactions. Systems can then compare this model to one representing an “ideal observer” in order to detect bias (Wall et al. 2017, Zhao et al. 2017).

**The Study: “Investigating the Effect of the Multiple Comparisons Problem in Visual Analysis”**

In the latest addition to this line of work, Zgraggen et al. conducted an experiment to determine how often an EDA process produces insights that are in fact spurious discoveries. They observed how people — specifically, students at a university — describe patterns that they observe as they create visualizations to analyze a data set. The participants were given up to 15 minutes to explore the data, and instructed to write down observations they wanted to report pertaining to the population from which the data was sampled, such as “the site’s customer population”.

A “true” insight can be defined as an observation made from sample data that holds for the population. A “false” insight is an observation made from sample data that does not hold for the population. A central concern is the false discovery rate: the expected proportion of false discoveries among only the discoveries made (i.e., rejected null hypotheses).

The study is impressively thorough in anticipating threats to its goal of measuring insights in a controlled way. The authors first used a pilot study to identify the types of insights people tended to observe (e.g., descriptions like “the mean of age distribution is between 20 and 30”, or “age is correlated with income”) . They then devised a process for simulating data samples for which ground truth could be established regardless of the type of insight. For example, because insights often describe multivariate relationships, they embedded ground truth relationships between *n*/2 of the variables in the ground truth (population) data. Randomly assigning these pairs of variables a non-zero correlation between -1 and 1, they generated data by sampling from bivariate normal random variables parameterized by these correlation coefficients. Means and variance were set using real world data for the same domain. They repeated this process to assign a data set for each participant.

In addition to insights that participants wrote down, Zgraggen et al. also accounted for *implicit *insights, conclusions that participants drew but did not report. Each participant completed a post-analysis interview, in which screen footage of the views they constructed was played back. The participant was asked to describe which parts of a visualization they observed (reinforced by eye-tracking data) and what visual features they had been looking for. Most implicit insights were conclusions that a pattern one was seeking did not exist.

All insights, explicit and implicit, were classified within a confusion matrix (i.e., True Positive, False Positive, True Negative, False Negative). The results that Zgraggen et al. report based on this analysis include the average False Discovery Rate (FDR) across participants: in their analysis, nearly 75% of all insights count as false discoveries! This result is concerning, to say the least, and raises important questions for the design of analysis tools.

Zgraggen et al. evaluate solutions for the high FDR by comparing three confirmation strategies: (1) confirming the insight by conducting hypothesis tests on the *same data* (a direct — yet clearly problematic — approach, as running a confirmatory test on the same data which was used to generate a hypothesis amplifies the likelihood of confirming that hypothesis), (2) confirming the insight by conducting hypothesis tests on *new held out data*, and (3) mixing confirmation and exploration by tracking visual comparisons and calculating a p-value using a *multiple comparisons adjustment *procedure. Confirming hypotheses on the same data set reduces the FDR by a whopping 63%, down to 11%, while confirming on a new data set or mixing exploration and confirmation further reduces the FDR to 4.6% or 6%, respectively.

However, it is also important that we consider how these results may (or may not) generalize to real-world situations. In the rest of this article, we visit some related questions, with the goal of informing future research efforts.

**The Critique: Is MCP the Wrong Problem?**

Like many in the visualization research community, we are excited about the broader view of people’s role in analysis put forth by this work. At the same time, we question whether EDA is accurately characterized by a MCP framework, and whether the solutions suggested by a MCP framework are in fact a step in the right direction for visualization research.

We can start by considering some assumptions in the MCP work, like the notion that any insights that are not supported by population data are false. Straightforward as this may seem, this classification gets more complex when we consider why data analysis is used to make decisions in the world, in contrast to a controlled study.

## Observing a Sample, or Making Inferences about a Population?

In Zgraggen et al.’s work, the participants are given a short time — up to 15 minutes — to examine data and write down “any reliable observations” that could be used to understand the population or make recommendations. Subjects were told that the datasets were “a small but representative sample.”

But, when examining a data set for the first time, it is natural to first build a description of the data one is seeing. For instance, an observation like “the mean age is between 20 and 30” seems like such a description. Were the participants as aware as the researchers of the difference between the sample data and the “reliable observations” they were asked to supply? To assess this question, one might analyze subjects’ accuracy relative to the sample in addition to the population distribution. A manipulation check — such as eliciting each participant’s *confidence in their insights *— might also help confirm that the participants and researchers were on the same page*.*

Assuming participants were paying close attention, how should they have interpreted the suggestion that the datasets were representative? Is it wrong to assume that this means the samples described the population distribution faithfully? Is it fair to compare insights based on this assumption to a ground truth distribution that was not faithfully represented by the sample?

*What is the Goal of an Analysis? Insight or Action?*

Study participants were instructed to identify insights, including both point estimates and relationships among variables. The study importantly extends prior treatments of “insight-based” methodologies, going beyond counts or rates of insights by modeling the accuracy of insights. However, the ultimate goal of analysis is often not to collect insights for their own sake, but to inform decisions: actions that have corresponding costs and benefits. As a result, labeling observations “incorrect” when they do not hold for a population (and “correct” when they do), may overlook larger concerns.

Zgraggen et al. describe a scenario in which Jean, an employee at a small non-profit, analyzes retention rate data to decide what thank-you gift should be sent to donors (perhaps better-liked gifts prompt additional donations?). Based on a visualization of historical data, Jean observes that the USB drive gift sent last year appears to have the strongest effect, and so decides to send that again. The authors characterize Jean’s insight as incorrect: the data Jean observed was sampled from a uniform distribution so no relationship can be confirmed. They go on to state that an “analyst ‘loses’ if they act on such false insights.” However, in many analysis settings, including the one described for Jean’s decision, an action must occur. Is it wrong to send the gift that resulted in the highest returns in the prior case? By describing the USB drive as the most expensive gift, the scenario implicitly concedes that features of the decision context impact the “correct” decision. What if the USB drive was the same price as the other options? The inclusion of this detail subtly problematizes the use of correctness labels as a domain-general approach.

What constitutes an action is also ambiguous. Is it realistic to think that an analyst “loses” if they allow themselves to be informed by (e.g., encode in memory) an apparent trend in data, even if that trend does not stand up to Null Hypothesis Significance Testing?

*Is the Ideal Analyst a Blank Slate?*

This question points to the implications of discussing MCP in visualization. Zgraggen et al.’s study intentionally used participants who lacked domain knowledge (and, given they were primarily undergraduate students, likely limited in analysis expertise as well). This implies that prior knowledge about similar data or the domain itself is not consequential. We contend that this is where a frequentist MCP perspective may even be dangerous to visualization research and practice.

In the Null Hypothesis Significance Testing framework assumed in most discussions of MCP, there is no explicit mechanism for incorporating the analyst’s prior knowledge. Is this an accurate characterization of how data analysis occurs in the real world? In many organizations, the same analysts analyze the same kinds of data repeatedly. Over time, the analyst naturally develops a sense of what trends are “normal”, which helps them in identifying anomalies or isolating more subtle trends over time. While preconceived notions can sometimes lead to bias as well, in a world where priors have no influence on the correctness of an insight, the “truth” value of data becomes slippery. Should the future of visualization systems really hinge on approaches that relegate analysis to the data set alone?

Contrast this approach with a Bayesian model of statistical inference. Rather than “objective” frequencies that can be defined by combining a likelihood function and the data, Bayesians find it more realistic to talk about the truth of an insight in terms of degrees of belief. Degrees of belief are a function of the data and the process that is suspected to have produced it. However, a key component of Bayesian statistical inference is the specification of *priors*, descriptions of the patterns that one would expect to see given past data or domain knowledge. Priors can help an analyst understand the degree to which the new data should shift their beliefs (for instance allowing us to talk about how *surprising* data are), and enable the accumulation of knowledge across related data sets.

**What’s Next: Tools for Better Visual Data Analysis**

Multiple tools have been proposed to combat the MCP in visual analysis. The approaches taken range from modeling low level interactions (e.g., clicks) against an “ideal” standard in which the analyst equally distributes their attention across the data (Wall et al. 2017) to using view information to infer comparisons and consequently hypotheses (Zhao et al. 2017). The idea of visualization systems building and continuously updating models of the analyst’s task, particularly when the analyst is allowed to refine the model (Zhao et al. 2017), is an exciting step forward for visualization.

We think a particularly promising way to re-envision visualization systems for analysis is to consider more direct techniques to elicit an analyst’s task and goals during EDA. Rather than creating tools that respond to certain patterns in low-level interaction data by “policing” the analyst, could we allow an analyst to articulate their goals and expectations about data through sketching or other interactions? There are many possible ways that a system could then use this data to adapt the views and operations available to the analyst, from using simple prompts to encourage the analyst to seek explanations for new insights in light of their expectations, to showing Bayesian posterior predictive distributions. By presenting uncertainty information by default, we may also improve the alignment between analysts’ insights and statistical models of inferential validity.

In conclusion, we found Zgraggen et al.’s study to be an important addition to the visualization literature, and a valuable prompt for thinking more deeply about the role (and assessment) of interactive data visualization. Did this discussion whet your own ideas about visual hypothesis testing, interacting with prior knowledge, or the exploratory data analysis process in the wild? Tell us what you think!

*This post was written by Jessica Hullman and Jeffrey Heer, with extensive input from the members of the **UW Interactive Data Lab**.*