Understanding statistics is essential to protecting your freedom

I am concerned about Big Data. I have written about this elsewhere but here I really want to emphasize some of the basic statistical errors data analysts make and the profound negative effects they have on our lives.

Let’s first start with Abraham Wald. Abraham Wald was a mathematical genius. He worked on a number of fields of research and gave birth to statistical sequential analysis and operations research. He also knew how to get answers to real statistical problems during World War II.

One famous task Wald had to complete was to help decide where to place armored plating on bombers. Wald starts where we would start today: to make any headway we need data. The data was collected from all planes that returned from missions. The ground crews were asked to make notes on the location of hits. When all the data was collected the following pictures emerged:

Wald’s bullet hole data

On the left is the diagram given to ground crews to fill out. The right hand image has all data from hits recorded. Notice that most of the plane is now black, except for the area around the cockpit and the tail area. Where should the armor plating go? Wald’s insight was to realize that his dataset was corrupt. He suggested putting armor plating on the parts of the plane where there were no bullet holes. Why? Those white spaces represent missing data. And that missing data is from the planes that did not return. In other words, getting hit in the white areas meant you were not going to get recorded. The data was not, what statisticians call, a “simple random sample”.

Random samples are important when using statistical methods to infer patterns from the data. However, most data we deal with has been selectively sampled. This is especially the case when we are dealing with data that is generated by individual’s choices. For example, consider surveys. These data are not random samples. Let’s think about why this is the case. How many of you have had request to complete surveys on the quality of service from retailer? How many of you have answered them? I know for myself that most of the time my thinking is that I have better things to do why waste time on a survey. This very fact leads to non-random sampling: only people who can be bothered to fill out the survey will submit their views. The rest of the population remains unsampled! Chances are that these people are systematically different from those that did not fill in the survey.

Selected samples are everywhere. It is very rare to have pure random samples. There is always a filter sitting between the population and the final sample you work with. And it does not matter how large you make your selected sample, you will always be losing data. “Big Data” practitioners assume that since their data is so large they must be getting close to the universe of numbers. This is incorrect and a dangerous mistake to make.

Where does my concern come from? Like most people, I read about the Snowden leaks. The NSA and GCHQ collect reams of data on our internet and cellphone usage. Most people are unafraid of this. “I have nothing to hide” is a common response. The only way you can hold onto this idea is that you assume someone (a person) is sitting and pouring over the data, looking at the patterns and realizing that you have done nothing wrong. For example, a human being can determine when something is most likely a joke. However, this folksy idea is wrong. There is no doubt that algorithms, designed to look for statistical patterns, are doing the hard work. Most of the algorithms require no human intervention and run thousands of alternative statistical models to find patterns (the “Big Data” approach). Notice the difference, the computer is looking for a pattern and will not stop until it finds one. There is an assumption of guilt in the algorithm. Is that a problem? Yes, especially when we are dealing with selected samples. As Viktor Mayer-Schönberger and Kenneth Cukier state in their book “Big Data: A Revolution that Will Transform how We Live, Work, and Think”:

In a big-data world, by contrast, we won’t have to be fixated on causality; instead we can discover patterns and correlations in the data that offer us novel and invaluable insights. The correlations may not tell us precisely why something is happening, but they alert us that it is happening.

This is worrying. Consider a statistical pattern discovered amongst terrorists’ browsing behavior prior to perpetrating an attack. Maybe they start visiting lots of websites about the afterlife. We don’t care why, it’s irrelevant. All we need to do is find this pattern somewhere else and we can find the bad guys. Let’s search all the metadata for this pattern. We find a bunch of people that are doing the same thing. Let’s bring these people in. Is this a good idea? Maybe our metadata is missing something. Maybe some of our targets are people who have lost loved ones and they have not mentioned it on Facebook (missing data due to selection). How will we be able to tell the baddies from the goodies? We won’t. Since the damage of one missed baddie is so great, let’s arrest all of them. We don’t care why these people browsed this way, all we care about is that they did. Right? Isn’t this worrying?

The data governments collect on us are imperfect summaries of who we are. We choose to post things on Facebook. We choose to send certain messages. We also choose to have face to face conversations and buy things with cash that go unrecorded in the data centers. The metadata collected on us is a selected sample. It takes human being to determine whether we are getting the full picture or not. A computer (as of now) cannot figure out what kind of data might be missing. It lacks that imagination. Humans are the only ones that can do that. Understanding statistics allows us to ask questions of how our data is interpreted and allows to engage with more serious questions about privacy and freedom.

Since writing this I have had some discussions with big data practitioners. The key problem they had is that there is much evidence to prove that these methods work very well. For example, consider the driver-less car that uses machine learning techniques. I agree, these algorithms perform remarkably well in these environments. But what differentiates theses areas from the sphere I discuss above is that corrections to the algorithm are immediate. In other words, when the car crashes you know the algorithm is bad. However, where are these crashes when dealing with profiling of the sort seen above? The corrections are absent.