Making room for implicit error: a visualization approach to managing data discrepancy

Let’s begin with a brief activity.

The figure below is a choropleth map showing the incidence rate of the Zika virus in countries across Latin America and the Caribbean. Each country is colored based on the cumulative number of confirmed Zika cases, relative to the size of the country’s population. The color spectrum runs from a pale yellow — indicating a low incidence rate — to a dark red — indicating a high incidence rate.

Choropleth map showing incidence rate of the Zika virus in countries across Latin America and the Caribbean. Countries are shaded based on the cumulative number of confirmed Zika cases, relative to the size of the country’s population.

Now suppose for a moment that we are global health experts, and we have been tasked with making recommendations about where to send more aid.

Looking at the choropleth, Brazil immediately stands out as having one of the highest incidence rates. A more careful scan reveals that Belize has an even higher incidence rate than Brazil, and that Nicaragua and Costa Rica have the next highest incidence rates following Brazil.

So from the looks of it, we really need to focus our analysis on these four countries.

Now what if I told you (and bear in mind that the following examples are purely hypothetical) that Brazil reports all cases of Zika — including any suspected cases — as confirmed cases, and that in reality, it should probably appear a bit lighter in the choropleth? And what if I told you that Colombia, on the other hand, only reports cases confirmed through a rigorous laboratory investigation and should, in reality, appear a bit darker in the choropleth? And it doesn’t stop there: Ecuador only reports cases of Zika detected in pregnant women, and should appear darker; Nicaragua tends to overreport in general, and should appear quite a bit lighter; and Jamaica has the opposite problem — it tends to underreport — and should appear quite a bit darker.

Taken together, this paints a very different picture…

(left) Original choropleth, (right) choropleth adjusted to reflect expert knowledge about data discrepancies.

…and could lead to very different recommendations about where to send aid.

And suppose that this new adjusted choropleth was based on the background knowledge of one global health expert, and to another expert, the adjusted choropleth looked completely different:

(left) Original choropleth showing incidence rates of the Zika virus, (middle, right) choropleth adjusted to reflect the knowledge of expert A and expert B.

How do we resolve this? How can we capture and incorporate expert knowledge about discrepancies in data in order to increase data quality, synchronize knowledge across experts, and perhaps even begin to analyze the discrepancies themselves?

We believe visualization can help!

Background

This research was based in a visualization collaboration with global health experts who were working to combat the Zika virus in countries across Latin America and the Caribbean. The examples above are an attempt to convey the revelation that snuck up on us, the visualization researchers, in the early stages of the project. Over and over again, we’d show our collaborators a choropleth of their data, and they’d say, yeah, but this country does X and so it really shouldn’t look like that.

As we dug deeper, we came to understand that the data we were visualizing was laced with discrepancies, rendering visualization effectively useless for analysis. It didn’t matter what clever new visualization techniques we came up with, it wasn’t going to help our collaborators with their decision-making.

The exciting piece was: 1) that our domain collaborators had intimate and extensive knowledge of these data discrepancies, and 2) that showing them choropleths of the original data was helping to trigger and organize this knowledge.

So while visualization couldn’t support their analysis, it could support the externalization and incorporation of expert knowledge about data discrepancies.

Based on this revelation, we pivoted the focus of our research to better understanding expert knowledge about discrepancies — which led to our notion of implicit error— and to understanding how to leverage visualization to capture implicit error and incorporate it back into the visual analysis pipeline.

Implicit Error

Ok, so what do we mean by implicit error?

The concept of measurement error is fundamental to experimental and observational science — it is the estimated difference between a measured value, and the true value as it exists in the world.

Grounded in our experience working with global health experts, we use the term implicit error to describe measurement error that is believed to be inherent to a dataset, assumed to be present and prevalent, but is not explicitly defined or accounted for. Implicit error exists as tacit knowledge in the minds of domain experts, it is perceived as effectively unquantifiable, and it is accounted for subjectively during expert interpretation of the data.

implicit error: measurement error that is believed to be inherent to a dataset, assumed to be present and prevalent, but is not explicitly defined or accounted for.

Where does implicit error come from?

In the context of the Zika outbreak, official data is generated through a sequence of stages, from detection to reporting.

Zika outbreak data generation pipeline.

Each country has its own pipeline, and each pipeline is shaped by the country’s political, economic, geographic, and demographic context, resulting in discrepancies like this:

The union in region X goes on strike often, and doesn’t report Zika case data.

and this:

country Y overhauled its surveillance system, leading to an increase in detected cases.

What you end with are subtle variations in the way these data are generated and processed — the vast majority of which goes undocumented and unaccounted for. So while implicit error is perceived and subjective, it is grounded in a general expectation of variation across distributed, heterogeneous, data generation pipelines.

We speculate that this extends much more broadly beyond Zika and global health.

What does implicit error look like?

Based on qualitative analysis of a sample set of implicit errors collected from our global health collaborators, we found that we could capture the important elements of implicit error, as perceived by experts, via a set of 6 characterizing traits: source, type, direction, magnitude, confidence and extent.

In the context of Zika outbreak data, sources of implicit error range from inconsistencies across data generation pipelines — e.g. country X reports all confirmed and suspected cases as confirmed cases, to retrospective adjustments to previously reported data — e.g. after retrospective review, laboratory-confirmed cases were adjusted by X’s Ministry of Health as of 25 August 2016 and X number of confirmed cases were reclassified as suspected.

The source of an implicit error is often critical for speculating about error type, which can be either systematic or random. For example, the inconsistency above — country X reports all confirmed and suspected cases as confirmed cases — is likely a systematic error, whereas the retrospective adjustment — after retrospective review, laboratory-confirmed cases were adjusted by X’s Ministry of Health as of 25 August 2016 and X number of confirmed cases were reclassified as suspected — is likely a random error. Characterizing the type of an implicit error has important implications for implicit error mitigation, as systematic error can often be reduced via downstream modeling or adjustments to the data generation pipeline itself.

Implicit errors can also be characterized by their direction, describing the sign of the difference between the reported value and the true value (positive/negative/unknown), and by the magnitude of the difference, which in our work with Zika outbreak data, was most often expressed qualitatively —e.g. reported confirmed cases really just shows the tip of the iceberg. The confidence trait describes the domain expert’s confidence in their knowledge or understanding of the error. Finally, the extent trait describes the data that are perceived to be impacted by the error.

We think of these characterizing traits as the data of implicit error, as they allow for visual exploration and analysis. It is interesting to consider how implied error, which is perceived by experts, relates to the more formal statistical definitions of forms of error. Implicit error could be construed as the starting point for potential downstream modeling of the error.

In addition to the set of characterizing traits, we found that including a contextual description of an expert’s knowledge of the implicit error was critical both for evaluating the underlying error in the data and transferring knowledge of the implicit error across experts. We think of these contextual descriptions as the information of implicit error.

To illustrate these concepts, let’s take a look at the following characterization of an implicit error:

  • source: inconsistency
  • type: systematic
  • direction: negative
  • magnitude: unknown
  • confidence: very certain
  • indicator extent: number of cases of Zika in pregnant women
  • geographic extent: country X
  • temporal extent: all weekly reports

From these traits we can see that we are dealing with an error in the cases of Zika detected in pregnant women, it’s perceived to be a systematic error, the true value is thought to likely be higher and it’s perceived as definitely happening.

The additional contextual description — country X only reports cases of Zika in pregnant women detected within the first trimester — provides critical insight into the nature of the error, why the error may exist, and the impact that it likely has on reported values. For example if, hypothetically, an expert knew that more and more cases of the Zika virus were presenting in the second and third trimesters, she would know that this was a significant discrepancy.

Where visualization fails and where it can help

As mentioned earlier, a high prevalence of implicit error — as was the case with the Zika outbreak data — precludes visualization from being used for any real analysis.

In early discussions with our collaborators we pitched the idea of allowing experts to manually adjust the colors in a choropleth to reflect their knowledge of discrepancies, similar to what we showed in the opening activity, to which one collaborator responded, “well, I can just do that in my head.”

What this approach doesn’t take into account is the cognitive load of storing many different mental adjustments, and then on top of that, trying to make comparisons and run other visual analysis tasks. Furthermore, how could we ever go about synchronizing these mental adjustments across experts, validating them, and compiling them in order to build institutional knowledge and to begin studying the errors themselves?

This is where visualization comes in!

Visualization has long been recognized as a powerful mechanism for eliciting knowledge and generating new knowledge. Work within the visual analytics community has looked deeply at the role that visualization plays in externalizing expert knowledge and incorporating it back into the visual analysis pipeline. This body of work, however, focuses on knowledge that can be represented and interpreted computationally. Our work looks to extend and adapt existing models in order to support the externalization and incorporation of implicit error — a more qualitative and subjective form of knowledge, which cannot be fully interpreted computationally, but which can provide tremendous insight to experts.

Our adapted model can be described in three stages: the identify stage, the externalize stage, and the analyze stage.

Process model for externalizing implicit error using visualization.

In what follows, we’ll discuss each of the stages at a high level, using a proof of concept that we developed — a tool for externalizing implicit error in Zika outbreak data — to help illustrate. For details on the model itself, please check out our paper!

Stage 1: identify

The goal of the first stage, the identify stage, is to identify the presence of implicit errors associated with an official dataset. In this stage, visualization plays a critical role in triggering and recalling expert insights about implicit error.

In our proof of concept, we developed a linked view system that allowed global health experts to explore and compare official Zika outbreak data, along with data about response efforts. The system uses standard GIS approaches. Users can explore the data at three different levels of resolution: the regional level (grouping countries into regions), the country level (showing individual countries), and the subnational level.

Proof of concept illustrating the identify stage of the process model.

Stage 2: externalize

Once an implicit error has been identified, the next stage supports the externalization. In our proof of concept, we used annotation as our primary externalization mechanism. This decision was grounded both in the literature, and in our experience working with global health experts — in which annotation proved to be an effective and intuitive approach.

Proof of concept illustrating the externalize stage of the process model.

Externalization involves capturing both the characterizing traits (source, type, direction, magnitude, etc…) and the contextualizing description.

Stage 3: analyze

In the final stage, the analyze stage, the implicit error traits and descriptions are encoded and overlaid on top of the official data. This allows experts to begin explicitly incorporating implicit errors into their interpretation, and to explore possible patterns in the errors themselves. In our proof of concept, users can explore implicit errors in their submitted form, as markers with annotations, or they can explore visual encodings of the various characterizing traits.

Proof of concept illustrating the analyze stage of the process model.

We just barely scratched the surface of the analyze stage in our proof of concept, however the idea is that as the tool is used, and as errors accumulate, the analyze layer can be developed to better support emerging visualization tasks, as well as any scalability issues that arise.

Discussion

Implicit error vs. uncertainty

One of the major underlying questions throughout this project was: where does our notion of implicit error fit into the huge body of work surrounding uncertainty and uncertainty visualization?

The two concepts are certainly related. They often stem from the same sources, and have the same impact on reported values. However, the vast majority of uncertainty work deals with quantified measures of uncertainty and measurement error. We therefore argue that the qualitative and subjective nature of implicit error requires a different set of considerations and visualization approaches.

Generating D’

Another related question that emerged again and again throughout this project was: should we be using implicit error to generate a D’ —that is, an adjusted dataset that incorporates implicit error in order to better reflect experts’ interpretation of the data? Or in other words, a representation of experts’ best guess at the true values for a given dataset.

Should we be generating D’ — an adjusted dataset that incorporates implicit error in order to better reflect experts’ interpretation of the data?

The opening activity likely gave the false impression that we would be pursuing this in our work. Our goal, in the end, was not to formally combine implicit error and observed data. But supposing that, down the line, we did quantify aspects of implicit error…is this really the answer? What function might this representation serve? And would we lose important aspects of implicit error through quantification?

This was a topic of great interest to us as visualization researchers — one that requires a much more thorough investigation.

Takeaways

In closing, we leave you with the following takeaways:

  1. implicit error exists — we believe much more broadly than Zika and global health data.
  2. visualization offers a powerful mechanism for externalization and analysis.
  3. we anticipate that the externalization and analysis of implicit error will reveal a entire new set of questions and opportunities for the field of visualization
  4. we hope that our framework can provide a starting point for further and deeper investigation into this area.

To learn more about this work, please check out our paper!