The issue with data voids and data cleaning

Published in

SI 410: Ethics and Information Technology

6 min readFeb 17, 2023

With almost 2 billion websites in the world, information seems indispensable. To access these websites, we often use search engines such as Google, Bing, and Yahoo. The user’s queries are processed and result in a list of indexed websites listed in order of pertinence but presumably without the algorithmic bias that one finds in social media. Although the amount of data available overall is exhaustive, not all searches and topics result in the same depth or breadth of results. Many searches will return data voids which “occur when obscure search queries have few results associated with them, making them ripe for exploitation by media manipulators with ideological, economic, or political agendas” (Golebiewski & Boyd, 2019, p. 2). While it may be that certain results are hidden due to perceived relevance or some other algorithmic deficiency, in many cases they will not return due to the lack of data collected on a topic, the misreporting, or the misrepresentation of data. Data voids find themselves rife with vulnerabilities and are often exploited. The lack of credible information and oftentimes the unwillingness to report certain types of data can cause prejudice, a false and exaggerated perception of reality, and irreparable damage to certain communities must be prevented.

In general, when a search engine of any type does not have normal amounts of information to retrieve, they are more likely to return lower-quality pages or pages that are not reliable. In looking at when data voids occur, there are a couple of common instances. One is where a term is simply not frequently searched. Golebiwski and Boyd give the example of mass shooter Dylan Roof’s statement on his radicalization towards white supremacy. He said, “[the Trayvon Martin case] prompted me to type in the words ‘black on white crime’ into Google, and I have never been the same since that day. The first website I came to was the Council of Conservative Citizens. There were pages upon pages of these brutal black-on-white murders.” Since the term “black-on-white crime” was not a common search for non-white supremacists, the only results he received were from alt-right sources reporting on stories where Black people were accused or suspected of murder. After his case hit news streams, the search term became more popular, became populated with more reliable sources, and no longer necessarily only returned problematic pages. This case highlights the dangers that data voids give our society. It is also a case that we see in other situations. Often in times of breaking news, certain search terms do not hold many results at first. An example may be the location of an event or the name of a person. In the time that news breaks, bad actors that understand the internet well enough have the ability to fill search terms with “red pills” that can send users down rabbit holes of disinformation. A response to these kinds of data voids is necessary but requires cooperation with search engines. We would need to ensure that high-quality sources are available in places of voids prone to exploitation. Search engine companies work to improve their algorithms but in their efforts to bridge voids, must also attempt to strike a balance with protecting free speech. It also raises the question of who data voids work for and who, if anyone, is responsible for data voids and policing them.

It is important to note that data is not unbiased. This is because, through the process of data cleaning, the same data points can tell a completely different story. Data cleaning in its basic form involves fixing or removing incorrect, corrupted, improperly formatted, or duplicate data from a dataset. Cleaning data is a necessary part of the data publication cycle but when it extends beyond its initial purpose, it runs the risk of distorting or losing the context of the data. In the case where this data is connected to an area where there is currently a data void, it could cause massive harm. Take for example the case of the kidnapping of schoolgirls in Nigeria. In 2014, over 200 Nigerian schoolgirls were kidnapped, a horrible story that became twisted in the media. The popular website, FiveThirtyEight, which specializes in stories regarding opinion poll analysis, politics, economics, and data analysis, reported that in 2013 there had been more than 3,608 kidnappings of young women in Nigeria. It also reported that there had been 2,285 in the first four months of 2014. They led the story with the alarming headline, “Kidnapping of Girls in Nigeria Is Part of a Worsening Problem” referencing the new instance of schoolgirl kidnapping. The images that FiveThirtyEight included a graph showing an exponential rise in kidnappings in a few years. Upon further investigation, it became clear that this was because of FiveThirtyEight’s use of its source, GDELT, which records observations of a political event, for example, the number of news articles about various events, and is primarily used to measure the temperature of the political climate. FiveThirtyEight used mentions of the kidnappings and equated them to individual occurrences of kidnappings. This makes it look like there was an upward trend where perhaps there wasn’t in reality. As stated in chapter 6 of Data Feminism, this is a classic case of the decontextualization of data. It is important to keep in mind that when looking at data, context matters and that all data and knowledge is situated. This is especially true when a source is using data to make broad overarching statements about the quality of life of a country as FiveThirtyEight had. Since kidnappings in Nigeria was not a highly searched term before there was a data void that the article from FiveThirtyEight moved into. This means it was highly read and likely internalized in the minds of many people. These people now likely have an incorrect perception of a country and a people.

Even without knowing who should be policing the data voids, it is clear that they have the potential to utterly sap the power of a group of people. In chapter 1 of Data Feminism, we hear the story of Serena Williams and the struggles she experienced during childbirth. While Williams was able to deliver a healthy baby girl, she almost lost her life in the process. Williams acknowledged that the reason her outcome was ultimately positive was likely because of her status as an athlete and public figure. Unfortunately, many women of color, especially Black women are not as fortunate. Numerous women of color including my own mother experience poor treatment during childbirth. My mother’s concerns, preferences, and pain levels were not taken seriously and it caused her to almost die during birth. According to the CDC, Black women are three times more likely to die in childbirth than white women, a statistic that does not diverge on economic status or education level. It is also shown that their concerns during medical procedures are addressed at a far lower rate than their white counterparts. In investigating this further, we run into a data void. There is virtually no tracked data available regarding complications and injuries sustained during childbirth.

We can find this void in other areas of medicine as well. Another example is in breast cancer research where Black women are 41% more likely to die from breast cancer than white women and only 3% of participants in trials and studies for breast cancer drugs by the FDA between 2008 and 2018 were Black women. This is a drastic disparity and certainly not proportional to the population of affected women nor to the population of Black women in the United States. As William Callaghan, the chief of the CDC’s Maternal and Infant Health branch said “what we choose to measure is a statement of [who] we value in health”. It is clear from the data voids that Black women are not being valued at the equal rate that they deserve in the medical sphere.

It should come as no surprise then that there is a clear trend of distrust of the medical world in Black communities. This manifested during the COVID-19 crisis during which time Black Americans were being vaccinated at a far lower rate than other racial and ethnic groups. Studies have cited distrust in medical communities as a culprit of the lower rate. The exclusion of Black people in studies and the data voids it causes has directly led to poor medical care for a huge percentage of our country’s population and a mistrust that has and will continue to manifest in poor public health. This is a case where the perpetrators of the data void and those suppressing the power of a community are apparent and yet there is no accountability.

Since the negative power of a data void is so drastic, it should be a priority to fill them. This means holding data analysts accountable for ethical and context-aware data cleaning. It also means holding researchers accountable for ignoring a group of marginalized people out of convenience. We must acknowledge that data and the way it is reported are not unbiased and carry heavy implications for our society.

The issue with data voids and data cleaning

Written by Avani Shingari