Data Inequity & Data Science

Marynia Kolak
Atlas Insights
Published in
3 min readMar 29, 2021

Adapted from a Lightning Talk presented at WiDS:Chicago

What is data inequity and how does it impact data science? We can approach this from multiple ways, as different researchers and scientists do.

Structural factors impact the data we work with, that often correlate with the actual things we’re interested in studying, and how that can serve to change the insights we make if not accounted for.

In my niche of geographic data science, we commonly see data like this —

Dot Density Map of Chicagoland area by Eric Fisher using 2000 Census data by racial/ethnic group

You’re looking at a dot density map of the Chicagoland area, with each color representing a different racial group or ethnicity. The segregation is striking, and reflects decades of complex policies and patterns of de-investment. It’s clear the data has a strong spatial signal, and that the patterns aren’t random..

If you were to run a statistical analysis off of this data without accounting for those spatial patterns, your basic assumptions of identically and independently distributed data would be violated. You’ll get results, but what will those numbers mean? The magnitude and direction could be off, and your understanding of causal mechanisms driving the data could be off.

Sometimes it’s even more complicated, and there will be patterns in how the actual data is generated that will lead to a biased sample, that can hide key features.

Proportion of births registered versus per capita annual income (log scale) for 151 countries with available data (http://www.who.int/whosis), by Peter Byass in The Unequal World of Health Data

In this graph, we see country-level reporting of the percentage of births registered, compared to the per capita income of that country (original study here). Countries with higher incomes are more likely to have 100% reporting of births registered. Higher poverty countries may not have the resources in place to report; this then skews any analysis on the data itself that attempts to link socioeconomic results.

The data generating process is key to the analysis, and how biases are played out in data collection can break algorithm goals.

Some hospitals and cities that built their policy-driving algorithms on health records or crime data that didn’t account for systemic biases, ended up perpetuating those outcomes in even worse ways, though that wasn’t their intention.

Sometimes missing data could? be intentional; colleague Kevin Credit recently found the number of missing data for race of a person at a traffic stop violation in Wisconsin increased after the murders of Eric Garner, Micheal Brown & Ferguson protests began.

Black Belt counties highlighted when visualizing 7-day positivity rates in July 2020 (uscovidatlas.org)

In a world driven by data, not being represented in data has serious consequences. Best intentions won’t be enough if we don’t consider data inequities, and how that impacts findings and proposals driven by data science.

We have to recognize this, and think of creative solutions, while continuing to advocate for better data. At the Covid Atlas project, we still don’t have ways to highlight racial disparities below the state level, because it doesn’t exist for the whole country.

We’re brainstorming solutions with our research coalition; for example here we highlight the Black Belt as a filter, a region in the South directly impacted by a legacy of slavery, to demonstrate connections of disproportionate burden on marginalized communities. We can use the spatial structures of data to help understand inequities in real time, because of the strong spatial signal that racism, poverty, and related phenomena send out.

We still have a ways to go, but it’s a start…

--

--