photo credit: Stephen Jackson

Can You Trust Published Data?

Most probably not.

Samer Costello
Data Wonk
Published in
8 min readMay 9, 2016

--

Picture this scenario. A man walks into a supermarket and a dashing young lady with a yellow Ariall shirt approaches him with a smile and asks “Excuse me, good sir, would you mind answering a question for me?”. The man nods in agreement. “Do you use Ariall or ONO as a washing detergent at home?” she inquires.

At this point, one of three situations typically occurs:
1. The man doesn’t know which detergent his family uses at home. He admits this, and the lady moves on to the next incoming person.
2. The man responds with one of the options, be it because he wants to keep talking to the lady, or he is in a hurry and just wants to be done with her, or even due to a mild case of curiosity, simply to see where the conversation goes. It doesn’t matter his motivations, nor which of the two options he picks, what matters is that he’s given her a false value that will be recorded onto the chart, and a few recordings later, becomes indiscernible from all the rest.
3. The man knows which detergent they use at home. He responds and gives her an accurate value.

Granted, there may be other responses, and different circumstances, but the important thing to note is this: Given that sort of situation, data collection can be easily affected by inherent biases by the respondents. Some may overstate values simply to impress, typically a case when men are asked about their incomes. Others may understate the same values, for various reasons, like the fear of attracting the taxman’s attention. Some may be racially or ethnically biased.
The point here is, the respondents cannot be trusted, at least not fully anyway. For the sake of future reference, let’s call this Exhibit A: Inherent Bias.

The second issue is in the question itself “Do you use Ariall or ONO?”. The researcher, no doubt sought to find out variations in the populace that use the two products. He wanted a simple comparison. Though his aims may have been noble, his method is not without fault.

“An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem.” — John Tukey

The question restricts the choices of the respondent. The respondent may be using neither, or may have used both, at different periods of his life. This is one of the ways in which a researcher may lead the respondents. He’s already determined which items he’s going to test for, probably assuming that the two are the leading brands in the market. This assumption, be it wrong or right, will give inaccurate results that may conflict with sales data later on.
Let’s call this Exhibit B: Hypothesis Bias.

Assuming this research is being done by an agency with the aim of advertising. They want to show that most people in the area use ONO, in a bid to convince Ariall to buy advertising space on their billboards in the locality. Their first result, shows 8 out of 10 respondents use Ariall. This isn’t favourable to their cause, so they shelf this result and try again, and they keep doing this until eventually, they get the result they want.

But how do they get this result? Simple. By restricting the size of the sample to a small number, they’re likely to get more varying results than is typical.

“If you torture the data long enough, it will confess” — Ronald H. Coase

Take a coin toss, for example. Theoretically, tossing a coin should give you 50% of the landings as heads and 50% as tails, but this isn’t always the case. If you toss it 10 times, you may end up with 8 heads and 2 tails. But, if you toss it, say 1000 times, the number will more or less conform to the theoretical 50:50 ratio.
It therefore isn’t difficult for the agency to get the value they want. We’ll call this Exhibit C: Sampling Bias.

They then publish their final result, with no mention of the previously shelved experiments. This then falsely implies that if one were to re-conduct this experiment, they’d get similar results, which is highly unlikely. We’ll call this Exhibit D: Unreported Bias.

Let’s assume that the agency also had to prove that the people living there had the means to buy Ariall’s washing machine detergent, a premium product compared to ONO’s washing powder. They could either target high income individuals, or use statistical values to delude the public. The former is harder to achieve than the latter. This is because it’s quite difficult to tell, even using dress code, whether a person is of middle-income means or rich. Finding very rich people among the general populace, say at the entrance to a supermarket is also improbable. So, what does the agency do?

They simply use the word ‘average’ in their report.
“Residents of Area X earn an average of Kes.10,000 a month” it states.
But what do they mean by average? Is it mean, mode or median?

As you can see from the above depiction, this is highly misleading. In most cases, the mean, mode and median tend to be very close together in the middle, but when the word average is used, and it is hidden from us that the distribution isn’t normal, this becomes outright misinformation. If Ariall uses the “average value” of Kes.10,000 to determine whether most of the residents could afford their product, they would be totally wrong. I’m gonna call this issue Exhibit E: Reporting Bias.

So, Statistics Are Biased. Is There Hope?

Yes.
But first, a word of caution. Never, and I mean, never, ever, take a statistic at face value. Always try and find out where the value came from, and determine the context for yourself.

So, how do we avoid an Exhibit A: Inherent bias situation?
There are several methods you can use. You can employ a variety of people to do the questioning, and average their results. You could use questionnaires, or digital survey machines like mSurvey. Or, you could simply not ask them.

Wait a minute. If we don’t ask them, how do we get the data?

Herein lies the beauty of living in this modern age. As compared to ancient times, where expression of self was pretty much restricted to writing ballads or literally shouting from the rooftops, telling your story today has become almost second nature. An archive of all these expressions, ranging from the sensible to the preposterous, all exist, ready for sampling, on the vast, open corridors of the internet.

Young teens find it easier to “be themselves” online than offline
— EU Kids Online project

It is said that our digital personas are representations of our inner selves, or at the very least, a window into who we aspire to become. This information when asked for, say in a survey, or interview, may not be willingly given by a person, but the same person feels no such reluctance in sharing it on a social platform.
These expressions, when scraped from these platforms, and fed through complex algorithms, can produce fairly accurate results. These results can then be used to better understand the behaviour of one’s user base, markets, consumers, or even the competition. Tools, like Dive (our in-house analytics platform) can be designed specifically to obtain these results.
However, having the tool, or even just the results, is not enough to make a concrete decision. Insights can only be acquired through the study of context, and the application of validity checks to see if the results concur when different data sets are employed.

This form of data acquisition takes care of Exhibit B: Hypothesis Bias, but not completely. What the researcher looks for using the tool (be it keywords, hashtags or handles) must be chosen in a way that takes into consideration as many relevant factors as possible.
The most common error resulting from using too few affecting-variables is the “post hoc ergo propter hoc” fallacy, meaning “after this, therefore because of this”. Many people, scientists included, mistakenly assume that if Z comes after Y, then Y causes Z. It can just as easily be the case that X causes both Y and Z. It could also be that W causes Y and decreases the probability of X, and X would normally prevent Z. Statistics alone can show correlation, but tell very little of causation.

When it comes to Exhibit C: Sampling Bias, the solution is quite simple: have the numbers. Having a pre-check when using social media tools ensures that search values are able to meet/surpass preset minimum thresholds. Fashion-related terms, for example, are very prominent on social media, and on fashion blogs, and hence can easily be used to determine fashion trends within the country. The same cannot be said for chicken supplements. The threshold check deters the use of social media as a way of investigating the chicken supplement case, and hence saves on resources. Similar prerequisites can be set for other sources of information.

Sadly, there isn’t much that can be done for Exhibit D: Unreported Bias. It’s simply the responsibility of the researcher to conduct accurate experiments and be open with the results they achieve. It’s hard to find any company/agency with a very open policy regarding it’s own data, other than open-source organisations and governments. However, the credibility of the data they release can be somewhat enhanced if they hire independent third party firms to conduct the research, as those are less likely to be biased in their research.

Exhibit E: Reporting Bias is mostly the prerogative of the media. As much as some reporters would like to fashion fancy headlines (otherwise known as clickbait), they must ensure that at the very least, suitable clarifications are made so as not to mislead their readers. The researchers should also endeavour to write their findings in a way that is easily understandable. This should facilitate easier publication of their findings, and may even encourage journalists (even those not data-oriented in nature) to take interest in the story the findings could provide.

There are indeed several other issues with statistics not discussed here, and there are those who would use such unethical practices to deceive the public (through general media), misinform executives (within organisations), and even lead their country down a treacherous path (through government-commissioned reports) to serve their own agenda. This statistical foul play is not easy to spot when read as headlines in newspapers, or as values in some chart during a board presentation. However, I believe, that with the little knowledge imparted unto you from this article, and your own determination to find out the truth behind the data, we can collectively begin making better-informed decisions, that will eventually help us improve ourselves, our organisations, and even the society as a whole.

--

--

Samer Costello
Data Wonk

I’ve learned I don’t know anything. I have also learned that people will pay for what I know. CTO @Odipodev | Data Activist