Three Ways Statistics Are Lying To You

Dave Rauschenfels
Analytics Vidhya
Published in
5 min readSep 3, 2020

Mark Twain once said that “Facts are stubborn things, but statistics are pliable.”

In the science of statistics the truth is in the analysis and this truth is fluid since there are scores of techniques available to cut and chart the information! But not all investigations are created equal. The study of statistics is rich with opportunities for analysts to map data and occasionally misrepresent it. With all the information streaming onto social media and the news every minute it is essential to recognize the trickery from fraud or incompetence.

For example, this is a graph of Coronavirus deaths in Minnesota for the first 150 days of the pandemic using data provided by the Minnesota Department of Health. The top graph counts the deaths per day. The lower graph sums all of the Coronavirus deaths for the first 150 days of the pandemic.

Minnesota Coronavirus Deaths

At first glance the Coronavirus death statistics are only random and subsiding noise. Yet when summed over time the data clearly shows a flattened curve. However there are so many more options available to spin or graph the data.

  1. Sample Bias
Sample Bias in Coronavirus deaths

The first option to spin the data is Sample Bias. This bias exists on the account of the researcher cherry-picking information to support a half-truth. In this example I have excluded fifty days of death data. Student News Daily recently reviewed the New York Times and found that it excluded New York from its chart on the growth of new positive cases for the Coronavirus. But not all bias is the result of half-truths. For example, news stations occasionally like to run polls with their audience on a variety of topics. However these polls are inclined to be flawed for the reason that they are a voluntary response from the audience. The poll results can be further skewed by undercoverage and even the choice of wording in the question.

2. Wrong Scales

Wrong scale of deaths in log

Scaling bias is another tool of the half-truth researcher. This bias is achieved by simply rescaling the graph. In this illustration I changed the earlier linear scale on the left axis into a much more alarming logarithmic scale. The deception is still obvious by the self-evident non-zero start of the scale of the left axis. News organizations can use this method to obscure the truth in plain sight knowing that their audience rarely read the scales or they could exclude the scale altogether.

Logarithmic scales still find legit applications in the financial markets for the representation of percentages. Log scales are attractive for the reason that analysts can conveniently represent the 100% increase or decrease in the price of a stock or commodity.

3. Statistical Non-Correlation

Statistical Non-Correlation of deaths

In the study of statistics it is the task of the researcher to show that a variable X correlates with variable Y. This variable Y should increase with variable X or it should decrease with X. Likewise, the variable Y could decrease while X increases. There is also the possibility that the variables X and Y are dependent on a third variable Z. For example, X and Y could be the heat and pressure that is dependent on burn rate Z. The final option is that the variables are totally unrelated.

In the above chart the number of Coronavirus deaths per day in Minnesota is displayed on the left next to the Intensive Care Unit (ICU) hospitalizations on the center right. Likewise for clarity, the average number of deaths on the center left is compared to the average number of ICU hospitalizations on the right. On the surface the ICU hospitalizations may appear correlated with the Coronavirus deaths and to the untrained observer they are. This is a bogus correlation.

The problem with statistics is that it’s been carefully crafted over the centuries to find correlations in information; where sometimes this correlation is fictional. Besides false correlations, mathematicians further have to contend with the illusions of Simpson’s Paradox. Simpson’s Paradox trends can appear in an individual set of data but vanish when it is combined with another set of data. For example, Berkeley University was once accused of favoring male applicants over females. Yet when they analyzed the admission statistics for specific disciplines women did better than the men.

My deception is obvious with the addition of trend lines in the subsequent graph.

Adjusted Correlation of deaths

It is these interpretations that tear open the discipline to incompetence, fraud and the general undermining of the scientific process.

Outsmarting The Fake News

In the age of fake news it is easy to be manipulated, but you don’t have to be a fool. You just need to ask yourself three questions anytime you see a chart.

  • Is there a bias in the sampling of the information?
  • Is there another variable that is affecting the data?
  • Could additional research contradict this finding?
  • Is the data being overly generalized?

--

--

Dave Rauschenfels
Analytics Vidhya

Field Service Engineer with a passion for technology and entertaining readers.