Photo by Pietro Jeng on Unsplash

Big (Biased) Data

How can bias seep into our data and what does this mean for applications of big data?

Polina Stepanova

--

The adage ‘numbers don’t lie’ comes to mind when one considers the reverence assigned to analysis backed by quantitative data. However, we often forget that it is humans who collect, interpret and present this data, and if I’ve learned anything after reading about the human brain it’s that it often can’t be trusted when making decisions or drawing conclusions. This fact leads to an uncomfortable realisation that as data, and especially big data, increasingly becomes our crutch, building block and key to unlocking potential, so too grows the threat of the bias within.

This article will explore the various biases one may encounter in selecting and interpreting data and the consequent implications for intelligent machines or research in general.

The first bias one may encounter is selection bias, which occurs when determining a data sample — should a non-random data set be selected, the sample may not be representative of the population being analysed. This is best brought to life through a real-life example.

During World War 2, in an attempt to reduce the number of planes lost in German airspace, the Royal Air Force decided to collect data on where planes were being hit by enemy fire and strengthen those areas to better withstand a barrage. This prompted a major research operation, the results of which showed the wings, nose and tail to be most susceptible to enemy fire. However, on the cusp of deciding to use precious metal to reinforce these areas on the entire fleet, someone pointed out that the data collected was from planes that had returned home and not the ones that were shot down. In short, using the wrong sample population could have led to a perilous waste of resources and done little to solve the problem.

Bias also affects data collection, as can be seen with the signalling problem. This is a fairly modern phenomenon where data is assumed to be representative of the online world, despite massive unseen gaps or “signal failures” that happen in certain communities or sample populations. For this example, I assume smartphone apps are a little closer to home than British warplane mechanics.

The city of Boston once came up with an intriguing idea: introduce a smartphone app which lets citizens report potholes, allowing authorities to access more real-time data and fix potholes more efficiently. However, having access to more data doesn’t necessarily mean it’s better or more reliable — access to smartphones may differ among income groups and familiarity with apps may diminish with the age of residents. Therefore, when judging the success rate of the initiative, or extrapolating results to identify areas where potholes are most common, the data will be incomplete and may lead to the wrong conclusions.

Today, this bias is rearing its head more often as those with frequent online access, who are more active online and on social media, form a large part of the database from which researchers and companies extract and analyse information.

There are many pitfalls to watch out for when interpreting data too. An amusing quote by Bobby Bragen outlines one problem: “Say you were standing with one foot in the oven and one foot in an ice bucket. According to the percentage people, you should be perfectly comfortable.”

Though more than a little facetious, the quote brings to light an important concept — when dealing with variables instead of percentages, there is more potential for misleading results. Economists for one will be familiar with confounding variables bias; if you were comparing data sets of the amount of ice cream sold in a given month and the amount of drowning incidents, you may see a strong correlation. However, the weather is a confounding variable — because it is hot, more people are having ice cream and more people are going for a swim and having accidents — so ice cream and drowning may not be directly correlated as it may initially seem.

The “correlation is not causation” principle also manifests when comparing data sets. Say you looked at the marriage rate in Alabama and the amount of people electrocuted by power lines in the US from 1999 to 2010 (for some reason). You would discover there is a 0.904 correlation coefficient between these two data sets. Similarly, if you looked at the population of wild rabbits and active warzones by geographical area you would see an inverse correlation. But of course, it would be silly to assume that rabbits are the unseen peace-keepers of the world. These are obvious cases but the message is clear, correlation between two sets of data can lead to the wrong conclusions.

Now let’s say we pull all these biases into an intelligent machine or a self-learning algorithm. Recently there has been lots of discussion revolving around COMPAS, a risk assessment tool created by a privately held company and used by the Wisconsin Department of Corrections. The tool assesses a person’s risk of re-offending or even just showing up to their court date. When such a tool is used in a courtroom, the implications of data bias become far more serious than an errant pothole or two.

As Google AI Chief John Giannandrea summarised, “The real safety question, if you want to call it that, is that if we give these systems biased data, they will be biased.”

Selection bias and the signal problem may skew the data if, for example, criminal trends are only taken from certain areas where crimes are well documented rather than from smaller settlements where records may not be easily accessible or detailed. Certain types of crime are better reported than others too, which may erroneously inform the algorithm on which types of crime are common. Any biases the police forces around the country may have relating to race, gender, sexuality or income will also be reflected in the data and feed into the algorithm. If the algorithm began spotting correlation between two variables without seeing the confounding variable, it would also assess someone’s risk of re-offending incorrectly, therefore exacerbating the bias — no surprise then that AIs have already been accused of very human biases such as racism.

Unfortunately, as it is privately owned, we still know little of how COMPAS actually functions, so for now only the owners and maybe the purchaser can see how the software makes decisions. However, the protection of trade secrets is hardly new, so perhaps the real issue is that there isn’t a law in place to regulate these sorts of algorithms yet. But as intelligent algorithms continue to permeate into the technology we use daily, from smartphones and online stores predicting what we like, to cars that will have to decide how to behave in an accident or collision, the matter of big bias could become a big problem.

More on COMPAS:

--

--

Polina Stepanova

Here to fuel my curiosity, share insight and write for fun. Passionate about storytelling, behavioural science, tech, gaming and assorted geeky matters.