Into the Voids: exploring data voids

5 min readJun 11, 2020

Into the Voids is a project created during the 2020 Assembly Fellowship at the Berkman Klein Center at Harvard University. One of three tracks in the Assembly: Disinformation Program, the Assembly Fellowship convenes professionals from across disciplines and sectors to tackle the spread and consumption of disinformation. Each fellow participated as an individual, and not as a representative of their organization. Assembly Fellows conducted their work independently with light advisory guidance from program advisors and staff.

The Into the Voids project and this post were authored by EC, Rafiq Copeland, Jenny Fan, and Tanay Jaeel; the project team has backgrounds in human rights, policymaking, design, and project management.

In a noisy information ecosystem where an abundance of answers lies at the ends of people’s fingertips, what happens when, suddenly, there is an absence of credible or authoritative information?

While this potential dearth of high-quality information has always been a known facet of search systems, the term “data void,” describing this phenomenon, was introduced by Michael Golebiewski and danah boyd in their 2019 Data & Society report. In our Assembly Fellowship with the Berkman Klein Center, we set out to better understand the nature of data voids, including:

What data exists to measure and further understand the extent of data voids?
How should we think about the harms posed by data voids?

This post highlights our research (read our full white paper here), which adds empirical data to this important discussion, to help better measure and understand the different problems presented by data voids. We also offer a harms framework, to distinguish between different types of data voids, in order to better tailor potential interventions.

In our research, we traced a well-known data void — the term “crisis actor” — for the duration of its lifecycle to map how search interest ebbs and flows in response to breaking news events.

What does the data show about data voids?

“Quantifying” the lifecycle of data voids can help society better understand how information about sensitive topics is surfaced, and how the potential for harm rises in the absence of credible information. To contextualize this problem, we created a timeline of credible information, including mainstream media coverage, around a number of well-known search terms to visualize how information emerges around relevant data voids.

We first compiled a list of breaking news search terms (data voids) associated with misinformation as the subjects for our research. For search terms that were innocuous on their face, we identified associated misinformation or conspiracies that began to spread online, so that we could track when those topics were covered by the mainstream media.

Next, we amassed publicly-available Google Search Trends data to quantify the number of searches for a particular term over time. To approximate the presence of media articles on a topic, we examined the MIT Civic Media Center and Berkman Klein Center’s MediaCloud system (an archive of media stories from across the web) and Wikipedia article data, given the prominent role that Wikipedia plays in search engine results.

In Table A (below), we plotted the peak week of search activity for each term (in red) and layered in the specific times that authoritative media articles entered the discussion (in blue), as well as specific times that edits were made to relevant Wikipedia articles (in yellow). With this data, we have a timeline of when credible news sources posted about a term, relative to when searches for that data void were spiking. In this case, we analyzed data voids related to the search queries “Iowa Caucuses” and “Sutherland springs”.

This data yielded several important conclusions. We see that mainstream media’s ability to respond to misinformation (and thus “close” a data void) is mixed: during the Iowa Caucuses, the media was swift and responsive in reporting that the caucus was not “rigged” and calling this out specifically as misinformation, so that by the time user search activity spiked, there was credible information available. In the case of the mass shooting in Sutherland Springs, however, the majority of debunking of Antifa-related misinformation happened after users had already moved on and search demand in that topic had dwindled. (Further analysis can be found in our full white paper.)

Table A: Timeline of search activity, media coverage, and Wikipedia edits for alleged data void search terms

How should we think about the harms posed by data voids?

With so many diverse topics, we explored “harm” as a potential distinguishing factor between benign and more malicious types of data voids. To get a rough heuristic for expected harm, first refer to these five questions about the subject matter of the data void. Not all categories will be relevant, so weight the relative importance of each category accordingly.

Who is the data void affecting?
What topic is the data void addressing?
Where is the impacted area of the data void?
How fast is this data void developing?
How long has this data void lasted?

Next, look at the broader media audience for context surrounding an emerging data void:

Scope of exposure: What is the audience size of those who might be exposed to and internalize mis/disinformation around a data void?
Likelihood of immediate action: How likely is immediate action to be taken as a result of data voids?
Severity of impact: How harmful are the consequences of these actions?

Harms framework to explore the risks posed by data voids

Using this framework, we start to see how some data voids pose greater harm than others. For the sake of example, consider a data void related to the effects of climate change vs. a data void related to conspiracies about an ethnic minority in a conflict-ridden area. While both are harmful, when observing key characteristics about the void (in this case, the topic relating to a marginalized community and the location’s high incidence of violence), we see a greater potential for harm from the latter example.

Through both our data analysis as well as our harms framework, we hope to have brought additional quantitative and qualitative contributions to our understanding of data voids.

Our project presents several questions for future study:

What is the user journey that leads people to data voids?
Good information is reasonably available during breaking news events. So, why do harmful fringe narratives persist even after the data void has been filled?
How can platforms insulate themselves against bad actors when a data void exists?
How can search and social media platforms ensure that authoritative content features prominently to users?

Perhaps to properly address the risk posed by data voids, we need not only accurate information, but also trust. Informational resources need an underlying connection to the user, which gives them a reason to not only read it, but to believe it. Only with that trust and credible belief will users support the information surfaced in search results, and proactively defend it when faced with misinformation in the future.

Until we understand how to fill data voids with both data and trust, there will always be a gap between what is known and what is believed.

For more information on the Into the Voids project, visit the team’s website. Learn more about the Assembly: Disinformation program at www.bkmla.org.

Into the Voids: exploring data voids

What does the data show about data voids?

How should we think about the harms posed by data voids?

Our project presents several questions for future study:

Written by Assembly at the Berkman Klein Center