Big Data: It’s Complicated

Galaan Abdissa
HCI with Galaan
Published in
4 min readJan 24, 2020

BUZZ WORD

In the era of artificial intelligence and machine learning, big data plays a fundamental role in producing important results. Data is becoming more and more complex given the many features that it contains. Because of this complexity and interconnectivity between various datasets, quantifying narratives and information does not always seem accurate. The main problem with big data comes from biases within a data set and bias from how data is manipulated in a machine learning algorithm. In Barocas and Selbt’s research paper, Big Data’s Disparate Impact, the problematics of training data stems from biases that are perpetuated by society. Given that a machine learning algorithm is taking in training data as input and producing an output, that output is then recycled to continuously mirror social prejudices. Hardt, author of How big data is unfair, coins the disparities in training data to be a “social mirror” in that data that is being used by objective mathematical models reflect injustices and biases in society that have not been adequately addressed. As a result, these prejudices can harm vast groups of minorities without the initial intent. Furthermore, biases in machine learning also are rooted from the lack of representation in sample sizes. Hardt states that sample size disparities can come from cultural differences and in most cases the sheer absence of numbers to represent minorities becomes an issue. When machine learning starts to make assumptions based off of majority data, problems may arise when the results from the majority harm the minority. Interestingly, the White House Office of Science and Technology’s report on Big Data brings up the concern of poor design and services to underrepresented populations. This tends to be the case for matching systems when they do not take record of historical biases.

Illustration by James Bareham / The Verge

EXAMPLES OF BIASES IN MACHINE LEARNING

There have been numerous cases where biases have existed in machine learning. For one, it is very difficult for an algorithm to take in noise and subjectivity. Uber and various ride-sharing platforms that rely heavily on ratings are impacted with biased training data. If a passenger becomes belligerent or threatening to the driver, the passenger has full leverage by lowering their driver’s score which then results in less potential customers. Uber and such ride-sharing companies are attempting to resolve this problem to protect their “employees”. Another case is with biases in machine-learning policing algorithms that correlate crime with geography. Some people argue that because such algorithms are using location as a training dataset and not an individual, there are no biases that exist in this model. However, given geography is a strong indicator of racial and socio-economic segregation, policing algorithms become very skewed to target minority groups. In addition to policing algorithms and ride-sharing companies, one prime issue is in the palm of our hands. Social media is another platform that is being damaged by the biases in the algorithms of Twitter, Facebook, Instagram and TikTok. Already, social media users get inundated by advertisements and political campaigns based on location, race, and mutual friends. This vast network of people results in an “echo chamber” that creates filtered bubbles that divide differences. Because of prejudices in the trained data and labels already associated with the users, social media algorithms are built to retain users and the best way of doing that is to assume what they like given the user data. The criminal justice system also has it struggles with biases in the recidivism algorithm. Because of historical prejudices against black males, black men are subject to be in prison at a higher rate than white men and for simple misdemeanors. Given the past accounts of biases that exist in race-drawn conclusions, the institution itself is problematic when it attempts to incarcerate mass number of black men.

After critically looking at the Georgia State University example highlighted in the Big Data Report by the White House, there are some questions to be asked about the “risk factors” that were valuable in assessing the students of the university. If the university took into account race or socio-economic status as “risk factors” and determinants for a student’s level of success, the example becomes very problematic. This is because, machine learning and big data is skewed to conclude certain individuals based on race or socio-economic background may have a harder time graduating on time. This assumption then directly associates academic success with socially-constructed labels. Similarly, with stop and frisk, the New York City administration under Michael Bloomberg thought that it was effective for police officers to use best judgement and various “risk factors” to stop and frisk a certain individual. These processes become highly problematic and frankly annoying when minorities are subjected to more surveillance.

ETHICS

https://www.knowmail.me/blog/ethical-dilemmas-age-ai/

Thankfully, there is hope. IBM is proposing to appoint chief AI ethics officials to regulate machine-learning data. Having assessments to critically examine machine learning decisions from biased data holds the technology and datasets accountable. Similarly, this work is being done throughout the tech industry and ethical AI is emerging to be the frontier focus for the future of technology. It isn’t a matter of implementing a while loop as it is addressing loopholes in biased data (hehe no pun intended) for a foreseeable, fair and transparent society.

--

--