How to Lie with Statistics
There is something about numbers isn’t there? They have an uncanny ability to suspend our common sense. When they appear, you believe some truth is about to be imparted. Today we live in the information world and much of this information is determined mathematically, using numbers.
Statistics are the sets of mathematical equations that we use to analyse things. Statistics help keep us informed about what is happening in the world around us. They offer us perspective on the past and can help us to make the future a bit more predictable.
Yet, statistics can be used to manipulate, sensationalise and confuse. Sneaky use of statistics is quite common in news, media and even medical research. However, once you know these statistical tricks it is difficult to un-see them. This post is based on the book How to Lie with Statistics by Darrell Huff, written in 1954, so it’s old, but it shows we’re still falling for the same stories.
Samples —an incomplete picture of the whole
“Average Americans brush their teeth 1.02 times a day.” When we hear this statistic, we ask questions like :
- How could they have figured this out?
- Does it make sense that it could have been researched properly?
- If it has been researched, don’t we think that people could have lied?
Samples are, essentially, incomplete pictures of the whole. How much of the whole? That is the question. When a sample is large enough and selected properly, it tells us something.
We see many conclusions from samples that are too small, biased or both. If our sample is large enough and selected properly, it will represent the whole well enough for most purposes. If it is not, it may be far less accurate than an intelligent guess.
A result of a sampling study is no better than the sample it is based on
By the time data has been filtered through layers of statistical manipulations and reduced to a decimal pointed average, the result could look very convincing. A close look at the sample may dent your confidence.
A pure random sample is the only kind that can be examined with complete confidence by statistical theories. There is one major issue though. It is so difficult and expensive to obtain for many projects, that the sheer cost eliminates it. A more economical substitute which is almost universally used in such fields as opinion polling and market research, is called stratified random sampling.
To get stratified random sampling, we divide the population into several groups in proportion to their known prevalence. There are a few big issues however:
- The information about the proportions may not be correct
- On top of that, how do you get a random sample within the stratification?
The obvious thing is to start with everyone’s name and go after randomly chosen names. But this is too expensive.
So we go into the streets, which biases your sample against stay-at-home people. We go from door to door by day and miss out on employed people . We switch to evening interviews and neglect the movie-goers and night-clubbers.
It is very important to choose the right sample selection process and to do it correctly, to get accurate findings. If you’re the client, it’s important to inquire about the sample selection process to understand the possible biases in the results.
Is there only one Average?
There are tricks that can be manipulated by the user/researcher while using average to describe a fact. The main idea is that there are three types of averages rather than one which is commonly assumed. The same data can give three different types of values when all three types of averages are calculated.
The 3 type of averages are :
- Mean: The mean is the usual average
- Median: The median is the middle value, after sorting the data in increasing order
- Mode: The mode is the number repeated more often than any other number.
We assume that these are the same but in reality different kinds of scenarios require different kinds of averages to describe the situation precisely —
Sometimes we just choose the one that supports our argument
In normal distributions (Bell curve), the three will be near to each other, but in irregular distributions (e.g. annual household income), these numbers will be vastly different.
Discarded Data, the data which is absent
Statistics used by the marketing of consumer products can be tricky. Obviously the statistics are going to be in favor of the product that is being presented.
First, the sample size can be very small. With smaller sample sizes, the variance is large. With 10 coin flips you can get 8 heads, but you’re not likely to get 80 heads in 100 coin flips.
By hiding the prevailing situation or pros and cons of an environment, any result of any study can be diverted according to the desire of the researcher.
Companies can keep running experiments until they get the results they want, discarding the experiments that failed to produce “significant findings.”
In summary, statistics is often used to prove some irrational proposition where there is no actual scope or reason. In doing so, they amplify a very small difference between two phenomena and try to prove one’s superiority over another, but the difference is basically very normal.
Graphs can blur, exaggerate and hide
Numbers are not always good enough or adequate to make any report worthy or comprehensible. There is no doubt that pictures are easy to understand and there is no better way to way to make numbers comprehensible to lots of people.
Lets look at some graph examples.
Non Zero Baseline
The first look at this graph makes it look like three times as many democrats supported the decision. But on closer inspection, note the scale on the vertical axis. Only slightly more democrats supported than republicans (62% vs. 54%).
If you really want to make a shocking statement, make sure you only include part of the data. Take this first example of a misleading graph designed to prove global warming. In the graph and data below only Jan-July month data is included!
Correlation vs Causation
Smoking can take the grades of a student downward — this was a finding that made a good number of people happy. The study was conducted meeting all the standards of statistics but it was based on the ancient fallacy that says “if B follows A, then A has cause B” . Correlations are part of our everyday life, and it is so easy to be misleading and assume it as a cause of some event.
It can very well be the other way round, that low grades may lead to smoking. Another fact is that both smoking and low grades can be the result of a third factor — that of not taking books seriously.
The trick used very commonly in media is relating an issue with another non-relevant issue just to serve the purpose of the presenter. In some cases there might be a positive correlation, but this correlation may work up to a point to grow the effect positively but after that it may hinder.
Have a look at the image above. At first glance, it appears as if ice cream sales and shark attacks are correlated. Should we believe that sharks are attracted to ice creams and hence a shark attack happens? The answer, of course, is No! Shark attacks are likelier to happen in summer, which is also the season when ice cream consumption increases.
The main idea here is that there might be a correlation between the two events but there are other factors influencing and bringing changes. The kind of manipulation done here is to relate to one of the hundreds of possible effects which is not and to claim that this input has bought the result.
Semi Attached Figure
If you can’t prove what you want to prove, demonstrate something else and pretend that they are the same thing — Darrell Huff
The semi attached figure is a tool that can be used to cope with any situation that is not in favor with the presenter.
Consider some examples :
- Clear weather is more dangerous than foggy weather as more accidents occur in clear weather. More accidents happen in clear weather because there is more clear weather than foggy weather.
- The cold remedy that kills germs not only kills that specific germ but also all the different types of germs in the test tube. The key here is not to talk about the other germs but only about the cold germ.
Another trick is based on the fact that the same data can be expressed in different ways. Here are some facts about a company:
- 1% return on sales
- 15% return on investment
- a 10 million dollar profit
- an increase in profits of 40% compared with some old average
- a decrease of 60% from last year
All these stories are told partially, so they are partially true but the whole story brings a different meaning.
How to Statisticulate/Manipulate
It is not always the statisticians who are busy producing manipulated statistics. Rather, a pure set of findings may be found distorted in the market by salespeople. Sometimes the policy designers or statisticians are not skilled enough to find or interpret the exact relationship between two events/phenomena and that leads to a poor policy.
Percentages and Percentage offer a fertile field for confusion. And like the ever-impressive decimal they can lend an aura of precision to the inexact.
There is a tale of a roadside merchant selling a rabbit burger:
He was asked to explain how he was able to sell a rabbit burger so cheap. “Well,” he said, “I have to put in some horsemeat too. But I mix’em 50:50, one horse, one rabbit !!”
How to talk back to Statistics
Ask 5 simple questions.
Who says so ?
The “Who” question affects the reliability of the information.
First thing to look for is bias — the laboratory with something to prove for the sake of a theory, a reputation, or a fee.
Look for a conscious bias. The method may be a direct misstatement or it maybe an ambiguous statement that cannot be convicted. It may be selection of favorable data and suppression of unfavorable data. Units of measurement may be shifted. A different average like median, or mode may be used covered under the unqualified word “average”.
Look carefully for unconscious bias. It is often more dangerous.
How do they know?
In what way was the information for the study gained? Was it reliable? Did people respond honestly?
The most used technique to present distorted information is to hide information. So seeking hidden or missing information can reveal lot more truth than the author says.
Did someone change the subject?
It is like changing the direction of the study and therefore presents a different kind of result.
Does it make sense?
Any statistical calculation will lead us to infer a decision from it. As it is obvious, we will be lured to infer on the basis of that calculation. It is not true that every inference is meaningful.
So as a common person, and not being an expert, the last question to ask is does it make sense or is it just irrational and out of context.
Remember: Statistics don’t lie, people do.
- How to Lie with Statistics — Darrell Huff